Embodiments of the invention relate generally to the field of partitioned multiple-processor systems, and more specifically to methods for effecting the partitioning of such systems.
Increasing data processing requirements have led to the development of larger and more complicated applications. Multiple-processor systems (MPSs) have been developed to execute such applications more quickly and efficiently.
A typical MPS may be implemented using a bus-based interconnection scheme.
To address these disadvantages, MPSs having a point-to-point, link-based interconnection scheme have been developed. Each node of such a system includes an agent (e.g., processor, memory controller, I/O hub component, chipsets, etc.) and a router for communicating messages between connected nodes. Each node may be directly connected to only a subset of the other nodes of the system. Typically such systems have a single manager for the entire system, but allow partitioning of the resources into logically independent systems, so that, for example, for an eight-processor MPS, two processors may be used for a first application, two others may be used for a second application, and the remaining four may be used for a third application.
Such systems provide improved performance, scalability, and reliability, but at the expense of a more complicated interconnect management protocol. That is, because there are multiple processors acting independently, synchronization is more complicated than the bus-based scheme that has a single point of synchronization. While overcoming many of the disadvantages of a bus-based scheme, the link-based implementation presents its own drawbacks as illustrated by reference to
For a system topology providing a high degree of flexibility (flexible route through), the addition or removal of a node from a partition requires the entire system to be quiesced. The time required to quiesce the entire system should optimally be as small as possible so as not to adversely affect system timeouts.
To avoid having to quiesce the entire MPS, the system topology may be constrained such that communications between agents of a given partition are not routed through agents of a different partition.
The invention may be best understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.
Reference throughout the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearance of the phrases “in one embodiment” or “in an embodiment” in various places throughout the specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Moreover, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the Detailed, Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.
Typically, the routing of messages (e.g., packets) in a MPS implemented using a point-to-point interconnection scheme is effected through the use of routing tables. In such networks messages proceed from a source node, through zero or more intermediate nodes, to a destination node. Each message contains an associated destination, and when a message is received at an intermediate node, the routing algorithm references the routing table to determine the next link over in which to route the message. In accordance with one embodiment of the invention, both a primary routing table (PRT) as well as an alternate routing table (ART) are created and programmed for each agent. The PRT is the routing table during normal operation of the MPS, while an ART is used upon the occurrence of a dynamic partitioning event or on-line event (OLE). An OLE is the addition or removal of a node from a partition. The occurrence of an OLE results in a change in the system topology. The topology of the system is altered by the OLE in that if a node is deleted, some routing paths no longer exist, since the node and its associated router are removed from the system. Likewise, the addition of a node results in the availability of additional routing paths. When this happens routing is switched from the PRT to the ART; the ART then becomes the PRT.
At operation 310, the nodes of the subject partition, as well as the nodes of any affected partitions, are determined by the management application (i.e., the management application detects nodes impacted by the OLE requested). For one embodiment, the management application is implemented in firmware. For one embodiment, affected partitions include those having nodes for which a removed node acted as a route through a component (in the case of an on-line node removal), and partitions, having nodes that may be used to communicate messages, routed along newly established routing paths (in the case of an on-line node addition). In general, affected partitions include the subject partition and are defined as those partitions for which the occurrence of an OLE results in an alteration of the routing path for any source-destination pair within the partition. It may be that less than all of the partitions of the MPS are affected by the OLE.
At operation 315, all of the source nodes of the subject partition and affected partitions are quiesced. A partition is quiesced when each node of the partition ceases issuing transactions; a transaction being defined as a message that is observable on the external link connecting two nodes. A quiesced partition resumes issuing transactions when subsequently directed to so by the management application. The source nodes include nodes having agents that generate transactions, such as, for example, a processor or an I/O agent. For one embodiment, the quiescing of the source nodes is effected by execution of a specific transaction communicated by the management application. For an alternative embodiment the quiescing of the source nodes is effected by a central agent setting a flag at each of the source nodes. For one embodiment of the invention, each source node is quiesced in a parallel manner. For example, each node receives and examines the quiescing transaction from the management application, and ceases communication of transactions. Each node then awaits completion of all previously communicated request transactions at which time the node agent indicates that quiescing is complete.
At operation 320, which is performed concurrently with the quiescing of the source nodes, the management application begins loading the ART for each determined node, which also includes the routing tables at each link of an intermediate router. In an alternative embodiment, the intermediate router is not associated with a particular node. To avoid deadlock, the node agents do not begin using the ART until quiescing of all source nodes of the subject and affected partitions is complete.
At operation 325, upon completion of the quiescing, the management application communicates a specific transaction to each of the determined node agents directing the node agents to begin using the ART. For one embodiment, the management application sets an indicator in each quiesced node agent resulting in the quiesced nodes resuming their normal operation using the ART. At this point, the OLE request can be granted.
At operation 330, the management application communicates a message to each source node directing the source node to leave quiescence and resume normal operation with the ART now labeled as the PRT.
At operation 335, the original PRT is redesignated to be the ART in anticipation of a subsequent OLE and the management application is informed that the MPS is ready to receive a subsequent OLE request.
As shown in
In accordance with the embodiment, as described above in reference to
As noted above, the complexity of the routing algorithm is increased due to the manner in which the PRT is overwritten with the ART at each node. For example, because the PRTs of the nodes in the subject partition and any affected partitions are removed as the update progresses, and the ARTs are as yet inactive, it may not be possible to establish a route to a source agent unless updating is effected in a specific order. In accordance with one embodiment, the management application establishes a linear order among all of the node agents in the subject partition and any affected partitions. The PRT of each node are then overwritten (updated) with the ART in the order established, beginning with the farthest and ending with the closest. In this way, the system does not attempt to communicate completion messages sent by a quiesced node along routes where the PRT cannot be used (i.e., can no longer be used).
Multiple Virtual Network Embodiments
A virtual network (VN) is a set of virtual channels along which any transaction, from a node, can be communicated. One or more VNs may be necessary for deadlock-free routing depending on the system topology. That is, for systems that support multiple VNs, routing algorithms are possible that permit more complex system topologies. For example, ring-based topologies, which reduce average routing distance, and hence, average routing time, require at least two VNs.
For embodiments of the invention described above, the same VN is used for both the PRT and the ART, and it is assumed that one virtual network is sufficient to provide deadlock-free routing for routing algorithms induced by both the PRT and the ART.
Alternative embodiments of the invention may be implemented on systems that support multiple VNs of which at least one VN is not required to support the system topology. For such embodiments, it is possible to effect dynamic partitioning/repartitioning, without quiescing the affected partitions, by restricting routing to less than all of the VNs and then upon notification of an OLE request, switching the routing to an unused VN.
At operation 510, an OLE request is received. The OLE request is received in response to an OLE, which may be an on-line deletion of a node or an on-line addition of a node.
At operation 515, the nodes of the subject partition, as well as the nodes of any affected partitions, are determined by the management application.
At operation 520, an ART, specific to a VN not being employed for PRT routing (e.g., VN1), is loaded for each determined node, which also includes the routing tables at each link of an intermediate router. At this point, all of the traffic in the one or more VNs employed for PRT routing continues as usual.
At operation 525, the management application communicates a specific transaction to each of the source node agents directing the node agents to begin using the ART. For one embodiment, the management application sets a control and status register addressed in the configuration space of each respective node agent. At this point, the OLE request can be granted.
At operation 530, upon directing all source node agents to begin using the ART, the management application verifies that all determined nodes are using the ARTs and that the PRTs are no longer in use. The subject partition can then be quiesced with respect to the VN providing PRT routing (e.g., VN0). For one embodiment, the verification that all determined nodes are using the ARTs and that the PRTs are no longer in use can be effected by the management application issuing a specific transaction (e.g., a “Synch” transaction) to each of the source nodes. In an alternative embodiment, the verification may be effected by a central agent resetting a flag at each of the source nodes. Receipt of an acknowledgment to this transaction from each determined node verifies that all determined nodes are using the ARTs and that the PRTs are no longer in use. For an alternative embodiment, verification can be effected by the management application waiting for a time period equal to at least the longest transaction lifetime for the MPS. The time period is used to determine when a subsequent OLE request can be granted, and is therefore quite flexible.
As shown in
General Matters
Embodiments of the invention provide methods and systems for dynamic partitioning of MPSs. Alternative embodiments of the invention are applicable MPSs having any number of agents and implementing two or more partitions.
Embodiments of the invention include methods having various operations, many of which are described in their most basic form, but operations can be added to or deleted from any of the methods without departing from the basic scope of the invention. The operations of various embodiments of the invention may be performed by hardware components or may be embodied in machine-executable instructions as described above. Alternatively, the operations may be performed by a combination of hardware and software. Embodiments of the invention may be provided as a computer program product that may include a machine-accessible medium having stored thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process according to embodiments of the invention as described above.
A machine-accessible medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a machine (e.g., a computer, network device, personal digital assistant, manufacturing tool, any device with a set of one or more processors, etc.). For example, a machine-accessible medium includes recordable/non-recordable media (e.g., read only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; etc.), as well as electrical, optical, acoustical or other form of propagated signals (e.g., carrier waves, infrared signals, digital signals, etc.); etc.
While the invention has been described in terms of several embodiments, those skilled in the art will recognize that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.