1. Field of the Invention
The embodiments of the invention relate to computer systems. Specifically, the embodiments of the invention relate to the maintenance of program order for the parallel execution of instructions.
2. Background
A central processing unit (CPU) of a computer system may include multiple execution clusters for processing instructions in parallel. Processing instructions in parallel increases the processing speed and efficiency of the computer system. Instructions are retrieved from a memory or storage device to be processed by an execution cluster. The instructions that are retrieved from memory may include instructions that conditionally branch to other sections of a program. The retrieval of instructions may be done in groups of sequential instructions. The retrieval of instructions may include speculative retrieval of instructions based on a ‘guess ’ of which path through a set of instructions is likely to be taken when a conditional branch instruction is executed.
The instructions that are retrieved are apportioned to each of the execution clusters. The apportionment of instructions may be based on cluster availability, cluster capabilities, a scheduling algorithm or similar considerations. The instructions may be apportioned to the execution units in sets of sequential instruction or as individual instructions.
The instructions have a sequential order in which they must be processed by the CPU for the proper function of the computer system and applications running on the computers system. Execution clusters may process the instructions out of order. However, the results of the processing of the instructions must then be reordered before they are used to update the architecture of a processor or used to generate signals to other components of a computer system. The instruction order of instructions assigned to execution clusters is tracked in a single global reorder buffer. This global reorder buffer is used by the circuitry that transfers the results of the executed instructions into the overall architecture of the CPU to determine which execution cluster contains the next instruction to be implemented in computer architecture.
Embodiments of the invention are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to “an” or “one” embodiment in this disclosure are not necessarily to the same embodiment, and such references mean at least one.
In one embodiment, the apportionment unit of fetch control unit 201 is responsible for assigning instructions retrieved from system memory 105 to available execution clusters. Instructions may be assigned to each execution cluster by a round robin scheme, a resource availability scheme or similar scheme. In one embodiment, instructions are divided into frequently used instructions and infrequently used instructions or sets of instructions. Infrequently used instructions or sets of instructions may be assigned to a first set of execution clusters and frequently used instructions or sets of instructions to a second set of execution clusters. In another embodiment, execution clusters may be optimized to perform discrete tasks such as floating point arithmetic. Fetch control unit 201 may assign instructions to execution units based on the type of computation or processing required by the instructions and the capabilities or specialization of the execution clusters.
In one embodiment, each execution cluster 203A, 203B includes a local reorder buffer (local ROB) 205A, 205B, queue or similar structure for storing the instructions to be processed by the execution cluster. CPU 101 may contain any number of execution clusters each having a local reorder buffer. In one embodiment, execution cluster 203A includes a local reorder buffer 205A and execution cluster 203B includes a local reorder buffer 205B. In one embodiment, local reorder buffer 203A stores the instructions assigned to the execution cluster 203A by fetch control unit 201. Local reorder buffers may be first-in first-out (FIFO) buffers or similar devices. Each entry in the buffer may correspond to an instruction. Each entry may also track additional data related to each instruction. In one embodiment, local reorder buffer 205A stores tags, pointers or similar data related to an instruction.
In one embodiment, local reorder buffers may also store data related to the status of the instruction or similar data related to the instructions. In one embodiment, the instructions are grouped into discrete sequences of instructions or ‘segments.’ Segments may be delineated based on control transfer instructions (CTIs) such that segments include a set of instructions that are associated with a single CTI or similar arrangement. A CTI may be a conditional branching instruction or similar event. CTIs require fetch control unit 201 to speculate as to how the CTI will be resolved and to predict subsequent instructions to be retrieved based on a ‘guess ’ as to which path will be taken after the CTI is resolved. In another embodiment, segments are delineated such that CTIs are the final instructions in the segment. Segments may be atomic blocks of instructions. Atomic blocks of instructions are sets of instructions that can be processed in parallel and if predicted by fetch control unit 201 subsequent to a CTI are discarded as a block if the prediction is inaccurate.
In one embodiment, local reorder buffers 205A, 205B are each connected with a global reorder buffer 207. Global reorder buffer 207 tracks the relative order of instructions and segments in each local reorder buffer 205A, 205B. Output or ‘retirement’ device 209 uses the global reorder buffer to determine which execution cluster contains the next instruction or segment to be output to update the architecture of CPU 101 and computer system 100. In one embodiment, retirement unit 209 is a single device that retrieves data from execution clusters 203A, 203B and updates CPU 101 architecture and generates signals to computer system 100 components. In another embodiment, retirement unit 209 is a set of components that implement the update of the architectural state. Global reorder buffer 207 also communicates with local reorder buffers 205A, 205B to update the local buffers when other execution clusters encounter mispredicted instructions. When a mispredicted instruction is encounter all instructions that were retrieved subsequent to the mispredicted instruction are erased or ‘flushed.’ A new set of instructions is then retrieved based on the actual resolution of the CTI that caused the misprediction. Global reorder buffer 207 works with local reorder buffers 205A and 205B to enforce a hierarchical distributed program reorder mechanism for CPU 101.
The hierarchical distributed system provides improved system performance by localizing the relevant sections of the reorder buffer to each execution cluster. This allows the relative program order of instructions to be primarily maintained within an execution cluster. This improves the speed of the processing of instructions because less delay is involved in updating and retrieving data from a local reorder buffer than a global reorder buffer due to the distance involved in communicating with the global reorder buffer. Overall program order is maintained by utilizing a small global reorder buffer to track the relative order of segments assigned to each local reorder buffer. Local reorder buffers and their execution clusters can operate independently because they are not dependent on the global reorder buffer to maintain coherency relative to each execution cluster. Increased instruction level parallelism (ILP) is obtained by increasing the independent operation of the execution clusters. This system is particularly effective in handling execution clusters that perform out of order (OOO) processing of instructions because the system maintains control over the retirement of instruction in program order and flushes out instructions that have been mispredicted even if processed by the execution cluster. Also, the hierarchical distributed system is not complex allowing power savings and cost effective manufacturing. Similarly, the distributed hierarchical reorder buffer improves the performance of fetch control unit 201 because local reorder buffers notify fetch control unit 201 upon detection of a mispredicted instruction allowing the faster retrieval of replacement instructions without having to wait until a global reorder buffer is notified of the misprediction of an instruction.
In the example of
In one embodiment, local reorder buffer B 205B notifies via signal 303 global reorder buffer 207 of the misprediction. The notification includes information identifying the segment that contained the misprediction and may include data identifying the local reorder buffer that generated the notification. In another embodiment, the identity of the local reorder buffer is determined based on the signal line that the notification was received on.
In one embodiment, local reorder buffer B 205B notifies fetch control unit 201 of a mispredicted instruction via signal 307. The notification includes data identifying the instruction that had been mispredicted or the segment of the CTI that caused the misprediction. Fetch control unit 201 may utilize this data to fetch the correct instructions subsequent to the mispredicted CTI. In one embodiment, during the same time period that the misprediction of an instruction is detected in local reorder buffer B 205B retirement unit 209 retrieves the next segment of instructions to be implemented in CPU 101 architecture or to generate signals to computer system 100 components. In one embodiment, retirement unit 209 relies on data stored in the local reorders buffers that tracks switch points in the assignment of segments to local reorder buffers. Retirement unit 209 uses this data to properly determine the order and location of instructions to be retired. In another embodiment, retirement unit 209 receives switch point data from the global reorder buffer 207. In the example, segment 1 is retired via signal 305. A retired local reorder buffer entry is then removed (i.e., deleted) from the local reorder buffer.
In one embodiment, at approximately the same time (e.g., during the same cycle), in this example, fetch control unit 201 operates independently to retrieve replacement segments labeled 6′ and 7′. Fetch control unit 201 assigns these segments via signal 315 to local reorder buffer A 205A. Fetch control unit 201 also notifies global reorder buffer 207 of the assignment of segments 6′ and 7′ to local reorder buffer A 205A via signal 319. In one embodiment, the retrieval and assignment of segments to local reorder buffers is orthogonal to the flushing processes carried out by the local reorder buffers and global reorder buffer 207. Also, occurring independently, retirement unit 209 retrieves via signal 317 the next segment, segment 2, to implement in architecture of CPU 101 or to generate signals to components of computer system 100. Segment 1 has been deleted from local reorder buffer A 205A because it had been processed by retirement unit 209.
In one embodiment, independent of the processing of the remote flush operation by each local reorder buffer, fetch control unit 201 fetches further instructions and assigns these instructions to the local reorder buffers. In this example, segments 6′ and 7′ are loaded into local reorder buffer A 205A. Fetch control unit 201 also retrieves segments 8′ and 9′. These segments are assigned via signal 321 to local reorder buffer B 205B. Fetch control unit 201 also notifies via signal 323 global reorder buffer 207 of the assignment of segments 8′ and 9′ to local reorder buffer B 205B. Global reorder buffer 207 adds an entry indicating a switch point at segment 6′ has been added to local reorder buffer A 205A. Retirement unit 209 processes the next segment via signal 325, which is segment 3.
In one embodiment, the local reorder buffer determines the location of the CTI that was mispredicted (block 405). If the CTI was at the end of a segment then only subsequent segments are affected by the misprediction. The local reorder buffer marks for flushing each segment subsequent to the segment with the CTI that generated the misprediction up to the next switchpoint. In one embodiment, the local reorder buffers track switch points along with other data regarding the segments. The local reorder buffer then notifies the global reorder buffer of the segment where the CTI is located that generated the misprediction (block 409). If the CTI is not at the end of a segment then the local reorder buffer also marks the entire segment where the CTI was located for flushing including instructions that may already have been processed by the associated execution cluster in addition to the subsequent segments until the next switch point (block 407). The local reorder buffer then notifies the global reorder buffer of the segment in which the misprediction occurred (block 409). The flush of marked segments is then initiated and completes when each has been deleted.
The hierarchical distributed reorder buffer system has improved efficiency in power savings, improved parallelism and speed due to the distributed architecture and localization or reorder tracking data. Tracking of the program instruction order is primarily performed local to each execution cluster reducing the amount of signaling and power consumed in communicating with the global reorder buffer. Parallelism is improved by the increased independence of the execution clusters in the distribute system. This increased independence improves the throughput of data in the CPU 101 and improves the overall computer system 100 speed.
In another embodiment, the hierarchical distributed reorder buffer system is used in a network device such as a router or similar device to maintain packet order or network data order. Packets, subcomponents of packets, frames and similar networking protocol groupings of data may be tracked by the hierarchical distributed reorder buffer. Networking protocols and packet handling often requires the out of order handling of packets or similar data. However, after processing, this data typically needs to be retransmitted in original order. Global reorder buffers are typically used for maintaining the order of network data. The hierarchical distributed reorder buffer is configured to track the order of packets or network data allowing efficient reordering with improved performance.
In one embodiment, the hierarchical distributed reorder buffer is implemented in software (e.g., microcode or higher level computer languages). The software implementation may also be used to run simulations or emulations of the hierarchical distributed reorder buffer. A software implementation may be stored on a machine readable medium. A “machine readable” medium may include any medium that can store or transfer information. Examples of a machine readable medium include a ROM, a floppy diskette, a CD-ROM, an optical disk, a hard disk, a radio frequency (RF) link, or similar media.
In the foregoing specification, the invention has been described with reference to specific embodiments thereof. It will, however, be evident that various modifications and changes can be made thereto without departing from the broader spirit and scope of the invention as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.