1. Field
This disclosure relates generally to processors and, more specifically, to enforcing global ordering of transaction executions in a computing system.
2. Description
It is common that a multiple processor computing system has two types of independent interconnects (e.g., buses), for example, one may be used to connect internal multiple cores with their shared cache (“internal interconnect”) within a processor and the other may be used to connect multiple processors (“system interconnect”). When such two types of interconnects exist, it is necessary to ensure that a program order is preserved across these two types of interconnects.
Executing a computer program generally results in issuing a series of transactions. The program executes in an order (“program order”) and expects the transactions that it issues to affect the system in the program order. In practice, a computer system may choose to cache memory and re-order certain transactions to achieve efficient operations. In doing so, the computer system needs to insure that the executing program “sees” the transactions being handled in the program order. In other words, the transactions must have the same effect visible from the program after caching and re-ordering as they would have had without caching or re-ordering.
If there is only one interconnect, a program order can be guaranteed by mechanisms inherited in the interconnect unit. When there are two or more interconnects (e.g., a processor internal bus and a system bus), however, a bridge (e.g., a bus bridge or a caching bridge) may be needed to couple these interconnects. In such cases, a processor's interconnect unit may no longer have sufficient system visibility to insure a program order on its own because it does not have control over the transaction execution order over a system interconnect. Thus, it is desirable for the bridge to have the ability of enforcing global ordering under which each program order may be maintained across multiple interconnects in a multiple processor system.
The features and advantages of the disclosed subject matter will become apparent from the following detailed description of the subject matter in which:
One goal of enforcing global ordering in a computing system with a bridge is to ensure that any program order is preserved across different interconnects. For example, regardless of the system interconnect's ability of re-ordering transactions to improve operation efficiency, it must be ensured that transactions are processed on the system interconnect in the order they are issued by a processor. One way to ensure a program order across a bridge is to enforce strict ordering, i.e., to serialize transaction completions on the system interconnect. In other words, a preceding transaction must be completed before any transactions following it can be completed. Although the strict ordering approach can ensure a program order, it makes a computing system very inefficient. A principal source of a system system's efficient performance is overlapped/re-ordered operations of its different pieces. Throttling the system interconnect to enforce strict ordering would be extraordinarily wasteful.
According to an embodiment of techniques disclosed in the present application, independent system interconnect operations are allowed to the greatest extent by distinguishing between cases where the order of transactions must be preserved and where strict ordering can be relaxed, and constraining transaction processing only when ordering is required. A bridge that couples two interconnects (e.g., an internal interconnect and a system interconnect) may be utilized to enforcing global ordering. The bridge typically handles transactions from two directions: outbound (from an internal interconnect to a system interconnect) and inbound (from a system interconnect to an internal interconnect). From a program correctness standpoint, so long as outbound and inbound transactions retain their system interconnect ordering within their respective groups, it is completely permissible to let inbound transactions pass completions on the path from the system interconnect to the internal interconnect. An Inter-Queue Ordering Mechanism (IQOM) may be used to achieve this purpose.
The IQOM may be located within a bridge that couples an internal interconnect and a system interconnect. The IQOM may comprise three separate queues: an outbound transaction queue (OTQ), an inbound transaction queue (ITQ), and a global ordering queue (GOQ). The OTQ may be used to ensure the strict completion order among outbound transactions. The ITQ may be used to ensure the strict completion order among inbound transactions. The GOQ may be used to enforce a non-uniform relative ordering policy: an inbound transaction that occurs after an outbound transaction on the system interconnect can be delivered by the bridge to the internal interconnect so long as the inbound transaction occurs before the completion of the outbound transaction on the system interconnect.
Reference in the specification to “one embodiment” or “an embodiment” of the disclosed subject matter means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the disclosed subject matter. Thus, the appearances of the phrase “in one embodiment” appearing in various places throughout the specification are not necessarily all referring to the same embodiment.
A caching bridge in a processor is responsible for receiving transactions from processing cores; looking up the shared cache and forwarding requests to the system interconnect if needed. It is also responsible for issuing incoming snooping transactions from the system interconnect to an appropriate core or cores inside the processor; delivering results from the system interconnect to the cores and updating the state of lines in the shared cache. The caching bridge may also enforce global ordering between a system interconnect and an internal interconnect.
A chipset 310 may connect two or more different groups together through connection 330. Chipset 310 may also help couple a graphics circuit, 10 devices, and/or other peripherals to processors (e.g., 350A, 360N, etc.). Chipset 310 may include a caching bridge 320 to couple group interconnects (e.g., 340A, and 340L) together. Caching bridge 320 may help enforce global ordering of transactions among group interconnects (e.g., 340A, . . . , 340L). In one embodiment, caching bridge 320 may be physically inside chipset 310. In another embodiment, caching bridge 320 may be coupled with chipset 310 but not physically inside the chipset. Different groups of agents may also be connected to each other through other devices including networking devices.
If an agent in system 300 is a processor, the processor may be single-core or multi-core processors. A multi-core processor may have its own caching bridge to couple its own internal interconnect with the group interconnect or directly with other agents through caching bridge 320 in chipset 310. A caching bridge within a multi-core,processor may coordinate with caching bridge 320 to enforce global ordering of transactions between a multi-core processor's own internal interconnect and group interconnects. Although not shown in
Caching bridge may also include scheduling and ordering logic 440. The scheduling and ordering logic may maintain the coherency of the cache lines present in shared cache 450. The scheduling and ordering logic schedules requests from cores to the shared cache and the system interconnect so that each core receives a fair share of resources in the caching bridge. A caching bridge typically handles transactions from two directions: outbound (from an internal interconnect to a system interconnect) and inbound (from a system interconnect to an internal interconnect). Inbound transactions are used to maintain system level cache coherency and are often referred to as snooping transactions. Snooping transactions may remove cache lines (also known as invalidation) in the shared cache when another agent requires exclusive ownership—generally to obtain write privileges for the snoop originator. Snooping transactions may also demote cache line access rights from ‘exclusive’ to ‘shared’ so that the snoop originator can read the line without necessarily removing it from other agents. Outbound transactions form the conjugate to snooping transactions: when a core wants write permission, it issues a read that invalidates other cores and other cache hierarchies. A simple core line read becomes, to other agents, a snoop that allows other agents to retain the cache line in ‘shared’ state. Note that not all read transactions or snoops have to be sent to the system interconnect. For example, if a cache line to be read can be found in a cache shared by different processing cores inside a processor and has the sufficient state, the read transaction does not need to be sent out to the system interconnect and accordingly there is no snoop corresponding to this read transaction on the system interconnect.
Scheduling and ordering logic 440 may ensure that inbound transactions received from the system interconnect are sent to appropriate core(s), and eventually deliver the correct results and data to the requesting core. An outbound transaction (e.g., a core's request for data) may be deferred by the scheduling and ordering logic (for example, the requested data is not present in the shared cache or it is present but also owned by other agents in the system). No particular order of completion is guaranteed for deferred transactions. In other words, the transaction ordering observed on the system interconnect may be quite different from the transaction ordering observed by cores. To preserve program orders, however, caching bridge 400, particularly, scheduling and ordering logic 440, needs to enforce global ordering, i.e., to ensure correct program orders expected by program-hosting cores between internal interconnects (e.g., 420) and system interconnect 410. A caching bridge typically enforces global ordering in a multi-processor system at a transaction level, independent of the underlying physical, link or transport layers used to communicate the transactions.
Although the IQOM is illustrated through a caching bridge in an MCMP system in
Any outbound transaction from a processing unit (e.g., a core, a single-core processor, an 10 device, a network controller, etc.) is allocated into OTQ 610 of a caching bridge associated with the processing unit with an indication of its age. An OTQ selector 640 may be used to select the oldest outbound transaction in the OTQ. Any inbound transaction from the system interconnect to the processing unit is allocated into ITQ 630 of the caching bridge with an indication of its age. An ITQ selector 660 may be used to select the oldest outbound transaction in the ITQ. All of the inbound and outbound transactions may be allocated into GOQ 620 of the caching bridge with an indication of their ages as observed on the system interconnect. The IQOM is capable of tracking, through the system interconnect, whether an outbound transaction in the GOQ is completed on the system interconnect and is ready to be delivered to the issuing processing unit. A GOQ selector may be used to select the oldest transaction among all the inbound transactions and all the outbound transactions that are ready to be delivered to the issuing processing unit (“completion transactions”) in the GOQ.
The IQOM may also comprise a controller 670 to determine which transaction among those selected by the ITQ, OTQ, and GOQ selectors should be selected and delivered to a corresponding processing unit for processing. At any one time, the controller may have a choice between the oldest inbound transaction (if any) and the oldest completion transaction (if any). Three rules may be used to select a queue whose top (oldest) entry will be issued to a processing unit at a processing unit issue point:
(1) If there is a completion transaction, which is the oldest in the GOQ, ready for processing, and an inbound transaction (if any) appears on the system interconnect after the completion transaction, then select the completion transaction for processing by the processing unit;
(2) If there is no completion transaction ready for processing and there is an inbound transaction ready, which is the oldest in the ITQ, then select the inbound transaction for processing by the processing unit; and
(3) If neither rule (1) nor rule (2) results in a selection, then wait until the next processing unit issue point to try again.
There may be a variety of extensions to this basic framework for selecting a transaction at a processing unit issue point. For example, some FSBs include the ability to defer a transaction. When that happens, the entry corresponding to this deferred transaction in a queue (ITQ or OTQ) is transferred to a defer pool. At a later point, when the deferred transaction is completed on the system bus, that deferred entry is transferred back to its corresponding queue. In some cases, additional rules may be needed to select a transaction at a processing unit issue point. For example, when additional sub-interconnects are used for completing a transaction, specific rules about relative ordering between all interconnects need be established.
In this particular example as shown in
At block 825, the identified oldest transaction at block 820 may be checked to determine if it is from the OTQ. This may be performed by identifying the oldest transaction in the OTQ and checking if this transaction the same as the oldest transaction from the GOQ. If they are the same, the oldest transaction from the GOQ is from the OTQ. Then, the transaction is further checked to determine if it is ready to be delivered to a processing unit for processing at block 830. If the oldest transaction from the GOQ is not from the OTQ (i.e., it is from the ITQ), it may be delivered to a corresponding processing unit for processing at block 845. If at block 830, it is determined that the oldest transaction in the GOQ, which is from the OTQ, is ready for processing, the transaction may be delivered to a corresponding processing unit for processing at block 845; otherwise, the ITQ is checked to determine if there is any transaction in it at block 835. If the ITQ is empty, the caching bridge waits until the next processing unit issue point at block 815 and then performs processing in blocks 820-835 again. If the ITQ is not empty, the oldest transaction in the ITQ may be identified at block 840. The identified transaction may be selected and be delivered to a corresponding processing unit for processing at block 845.
After the selected transaction at block 845 is delivered to a corresponding processing unit for processing, the transaction may be de-allocated from the GOQ at block 850. Then the process from block 810 until block 850 may be re-iterated so that global ordering may be enforced so long as the multi-processor system is running.
Although an example embodiment of the disclosed subject matter is described with reference to block and flow diagrams in
In the preceding description, various aspects of the disclosed subject matter have been described. For purposes of explanation, specific numbers, systems and configurations were set forth in order to provide a thorough understanding of the subject matter. However, it is apparent to one skilled in the art having the benefit of this disclosure that the subject matter may be practiced without the specific details. In other instances, well-known features, components, or modules were omitted, simplified, combined, or split in order not to obscure the disclosed subject matter.
Various embodiments of the disclosed subject matter may be implemented in hardware, firmware, software, or combination thereof, and may be described by reference to or in conjunction with program code, such as instructions, functions, procedures, data structures, logic, application programs, design representations or formats for simulation, emulation, and fabrication of a design, which when accessed by a machine results in the machine performing tasks, defining abstract data types or low-level hardware contexts, or producing a result.
For simulations, program code may represent hardware using a hardware description language or another functional description language which essentially provides a model of how designed hardware is expected to perform. Program code may be assembly or machine language, or data that may be compiled and/or interpreted. Furthermore, it is common in the art to speak of software, in one form or another as taking an action or causing a result. Such expressions are merely a shorthand way of stating execution of program code by a processing system which causes a processor to perform an action or produce a result.
Program code may be stored in, for example, volatile and/or non-volatile memory, such as storage devices and/or an associated machine readable or machine accessible medium including solid-state memory, hard-drives, floppy-disks, optical storage, tapes, flash memory, memory sticks, digital video disks, digital versatile discs (DVDs), etc., as well as more exotic mediums such as machine-accessible biological state preserving storage. A machine readable medium may include any mechanism for storing, transmitting, or receiving information in a form readable by a machine, and the medium may include a tangible medium through which electrical, optical, acoustical or other form of propagated signals or carrier wave encoding the program code may pass, such as antennas, optical fibers, communications interfaces, etc. Program code may be transmitted in the form of packets, serial data, parallel data, propagated signals, etc., and may be used in a compressed or encrypted format.
Program code may be implemented in programs executing on programmable machines such as mobile or stationary computers, personal digital assistants, set top boxes, cellular telephones and pagers, and other electronic devices, each including a processor, volatile and/or non-volatile memory readable by the processor, at least one input device and/or one or more output devices. Program code may be applied to the data entered using the input device to perform the described embodiments and to generate output information. The output information may be applied to one or more output devices. One of ordinary skill in the art may appreciate that embodiments of the disclosed subject matter can be practiced with various computer system configurations, including multiprocessor or multiple-core processor systems, minicomputers, mainframe computers, as well as pervasive or miniature computers or processors that may be embedded into virtually any device. Embodiments of the disclosed subject matter can also be practiced in distributed computing environments where tasks may be performed by remote processing devices that are linked through a communications network.
Although operations may be described as a sequential process, some of the operations may in fact be performed in parallel, concurrently, and/or in a distributed environment, and with program code stored locally and/or remotely for access by single or multi-processor machines. In addition, in some embodiments the order of operations may be rearranged without departing from the spope of the disclosed subject matter. Program code may be used by or in conjunction with embedded controllers.
While the disclosed subject matter has been described with reference to illustrative embodiments, this description is not intended to be construed in a limiting sense. Various modifications of the illustrative embodiments, as well as other embodiments of the subject matter, which are apparent to persons skilled in the art to which the disclosed subject matter pertains are deemed to lie within the scope of the disclosed subject matter.