1. Field of the Invention
The invention disclosed and claimed herein generally pertains to a method and related apparatus for data transfer between multiple root nodes and PCI adapters, through an input/output (I/O) switched-fabric bus. More particularly, the invention pertains to a method of the above type wherein different root nodes may be routed through the I/O fabric to share the same adapter, and a single control, used to configure the routing for all root nodes, resides in one of the nodes. Even more particularly, the invention pertains to a method of the above type wherein a challenge procedure is provided, to resolve any uncertainty as to which node is serving as the control node.
2. Description of the Related Art
As is well known by those of skill in the art, PCI Express (PCI-E) is widely used in computer systems to interconnect host units to adapters or other components, by means of an I/O switched-fabric bus or the like. However, PCI-E currently does not permit sharing of PCI adapters in topologies where there are multiple hosts with multiple shared PCI buses. As a result, even though such sharing capability could be very valuable when using blade clusters or other clustered servers, adapters for PCI-E and secondary networks (e.g., FC, IB, Enet) are at present generally integrated into individual blades and server systems. Thus, such adapters cannot be shared between clustered blades, or even between multiple roots within a clustered system.
In an environment containing multiple blades or blade clusters, it can be very costly to dedicate a PCI adapter for use with only a single blade. For example, a 10 Gigabit Ethernet (10 GigE) adapter currently costs on the order of $6,000. The inability to share these expensive adapters between blades has, in fact, contributed to the slow adoption rate of certain new network technologies such as 10 GigE. Moreover, there is a constraint imposed by the limited space available in blades to accommodate PCI adapters. This problem of limited space could be overcome if a PC network was able to support attachment of multiple hosts to a single PCI adapter, so that virtual PCI I/O adapters could be shared between the multiple hosts.
In a distributed computer system comprising a multi-host environment or the like, the configuration of any portion of an I/O fabric that is shared between hosts, or other root nodes, cannot be controlled by multiple hosts. This is because one host might make changes that affect another host. Accordingly, to achieve the above goal of sharing a PCI adapter amongst different hosts, it is necessary to provide a central management mechanism of some type. This management mechanism is needed to configure the routings used by PCI switches of the I/O fabric, as well as by the root complexes, PCI adapters and other devices interconnected by the PCI switches.
It is to be understood that the term “root node” is used herein to generically describe an entity that may comprise a computer host CPU set or the like, and a root complex connected thereto. The host set could have one or multiple discrete CPU's. However, the term “root node” is not necessarily limited to host CPU sets. The term “root complex” is used herein to generically describe structure in a root node for connecting the root node and its host CPU set to the I/O fabric.
In one very useful approach, a particular designated root node includes a component which is the PCI Configuration Master (PCM) for the entire multi-host system. The PCM configures all routings through the I/O fabric, for all PCI switches, root complexes and adapters. However, in a PCI switched-fabric, multiple fabric managers are allowed. Moreover, any fabric manager can plug into any root switch port, that is, the port of a PCI switch that is directly connected to a root complex. As a result, when a PCM of the above type is engaged in configuring a route through a PCI fabric, it will sometimes encounter a switch that appears to be controlled by a fabric manager other than the PCM, residing at a root node other than the designated node. Accordingly, it is necessary to provide a challenge procedure, to determine or affirm which root node actually contains the controlling fabric configuration manager.
The invention generally provides a challenge procedure or protocol for determining the root node in which the PCI Configuration Master or Manager actually resides, in a multi-host system of the above type. This node is referred to as the master node. The challenge procedure is activated whenever the identity of the PCM, determined by the root node containing the PCM, appears to be uncertain. The challenge procedure resolves this uncertainty, and enables the PCM to continue to configure routings throughout the system. In one useful embodiment, the invention is directed to a method for a distributed computer system provided with multiple root nodes, and further provided with one or more PCI switches and with adapters or other components that are available for sharing by different nodes. The method includes the steps of selecting a first one of the root nodes to be the master root node for the system, and operating the first root node to implement a procedure whereby the first root node queries the configuration space of a particular one of the PCI switches. The method further includes detecting information indicating that a second root node, rather than the first root node, is considered to be the master root node for the particular switch. A challenge procedure is implemented in response to this detected information, in an effort to confirm that the first root node is in fact the master root node for the system. The configuration space querying procedure is then continued, if the first root node is confirmed to be the master root node. Otherwise, the querying procedure is aborted so that corrective action can be taken. Usefully, when the PCM is performing PCI configuration, all the root nodes are in a quiescent state. After the switched-fabric has been configured, the PCM writes the configuration information into the root switches, and then enables each of the root ports to access its configuration.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The RCs 110, 120, and 130 are integral components of RN 160, 162 and 164, respectively. There may be more than one RC in an RN, such as RCs 140 and 142 which are both integral components of RN 166. In addition to the RCs, each RN consists of one or more Central Processing Units (CPUs) 102-104, 112-114, 122-124 and 132-134, memories 106, 116, 126 and 128, and memory controllers 108, 118, 128 and 138. The memory controllers respectively interconnect the CPUs, memory, and I/O RCs of their corresponding RNs, and perform such functions as handling the coherency traffic for respective memories.
RN's may be connected together at their memory controllers, such as by a link 146 extending between memory controllers 108 and 118 of RNs 160 and 162. This forms one coherency domain which may act as a single Symmetric Multi-Processing (SMP) system. Alternatively, nodes may be independent from one another with separate coherency domains as in RNs 164 and 166.
It is to be understood that any one of the root nodes 160-166 could support the PCM. However, there must be only one PCM, to configure all routes and assign all resources, throughout the entire system 100. Clearly, significant uncertainties could develop if it appeared that there was more than one PCM in system 100, with each PCM residing in a different root node. Accordingly, embodiments of the invention are provided, first to determine that an uncertain condition regarding the PCM exists, and to then resolve the uncertainty.
In a very useful embodiment, a challenge protocol is operable to recognize that a PCI switch, included in the switched-fabric of the system, appears to be under the control of a PCM that is different from the PCM currently in control of the system. Upon recognizing this condition, the challenge protocol will either confirm that the current PCM has control over the switch, or else will abort configuration of the switch. This challenge protocol or procedure is described hereinafter in further detail, in connection with
Distributed computing system 100 may be implemented using various commercially available computer systems. For example, distributed computing system 100 may be implemented using an IBM eServer iSeries Model 840 system available from International Business Machines Corporation. Such a system may support logical partitioning using an OS/400 operating system, which is also available from International Business Machines Corporation.
Those of ordinary skill in the art will appreciate that the hardware depicted in
With reference to
Partitioned hardware 230 includes a plurality of processors 232-238, a plurality of system memory units 240-246, a plurality of input/output (I/O) adapters 248-262, and a storage unit 270. Partition hardware 230 also includes service processor 290, which may be used to provide various services, such as processing of errors in the partitions. Each of the processors 232-238, memory units 240-246, NVRAM 298, and I/O adapters 248-262 may be assigned to one of multiple partitions within logically partitioned platform 200, each of which corresponds to one of operating systems 202, 204, 206 and 208.
Partition management firmware (hypervisor) 210 performs a number of functions and services for partitions 212, 214, 216 and 218 to create and enforce the partitioning of logically partitioned platform 200. Hypervisor 210 is a firmware implemented virtual machine identical to the underlying hardware. Hypervisor software is available from International Business Machines Corporation. Firmware is “software” stored in a memory chip that holds its content without electrical power, such as, for example, read-only memory (ROM), programmable ROM (PROM), electrically erasable programmable ROM (EEPROM), and non-volatile random access memory (NVRAM). Thus, hypervisor 210 allows the simultaneous execution of independent OS images 202, 204, 206 and 208 by virtualizing all the hardware resources of logically partitioned platform 200.
Operation of the different partitions may be controlled through a hardware management console, such as hardware management console 280. Hardware management console 280 is a separate distributed computing system from which a system administrator may perform various functions including reallocation of resources to different partitions.
In an environment of the type shown in
Accordingly, some functionality is needed in the bridges that connect IOAs to the I/O bus so as to be able to assign resources, such as individual IOAs or parts of IOAs to separate partitions; and, at the same time, prevent the assigned resources from affecting other partitions such as by obtaining access to resources of the other partitions.
Referring to
Referring further to
Respective ports of a multi-root aware switch, such as switches 302 and 304, can be used as upstream ports, downstream ports, or both upstream and downstream ports. Generally, upstream ports are closer to the RC. Downstream ports are further from RC. Upstream/downstream ports can have characteristics of both upstream and downstream ports. In
The ports configured as downstream ports are to be attached or connected to adapters or to the upstream port of another switch. In
Each of the ports configured as an upstream port is used to connect to one of the root complexes 338-342. Thus,
The ports configured as upstream/downstream ports are used to connect to the upstream/downstream port of another switch. Thus,
I/O adapter 352 is shown as a virtualized I/O adapter, having its function 0 (F0) assigned and accessible to the system image SI 1, and its function 1 (F1) assigned and accessible to the system image SI 2. Similarly, I/O adapter 358 is shown as a virtualized I/O adapter, having its function 0 (F0) assigned and assessible to SI 3, its function 1 (F1) assigned and accessible to SI 4 and its function 3 (F3) assigned to SI 5. I/O adapter 366 is shown as a virtualized I/O adapter with its function F0 assigned and accessible to SI 2 and its function F1 assigned and accessible to SI 4. I/O adapter 368 is shown as a single function I/O adapter assigned and accessible to SI 5.
Referring to
In accordance with the invention, it has been recognized that the extended capabilities area 402 can be used to determine whether or not a PCI component is a multi-root aware PCI component. More particularly, the PCI-Express capabilities 402 is provided with a multi-root aware bit 403. If the extended capabilities area 402 has a multi-root aware bit 403 set for a PCI component, then the PCI component will support the multi-root PCI configuration as described herein. Moreover,
It is to be understood that the PCM ID is a value that uniquely identifies the PCM, throughout a distributed computer system such as system 100 or 300. More particularly, the PCM ID clearly indicates the root node or CPU set in which the PCM component is located.
Referring to
As is known by those of skill in the art, a unique VPD ID is assigned to a host CPU set when the unit is manufactured. Thus, respective host CPU sets of system 300 will have VPD ID values that are different from one another. It follows that to provide a unique value for PCM ID, the host CPU set having the highest VPD ID value could initially be selected to contain the PCM, and the PCM ID would be set to such highest VPD ID value. Alternatively, the host CPU set having the highest user ID, the highest user priority, or the highest value of a parameter not shown in information space 502 could be initially selected to contain the PCM component, and the PCM ID would be such highest value. The root node or host CPU unit initially designated to contain the PCM, and to thereby be the master root node for the system, could be selected by a system user, or could alternatively be selected automatically by a program.
Referring further to
An important function of the PCM 370, after respective routings have been configured, is to determine the state of each switch in the distributed processing system 300. This is usefully accomplished by operating the PCM to query the PCI configuration space, described in
Referring to
Referring further to
In systems such as those of
Referring to
Function block 702 and decision block 704 indicate that the procedure of
Referring again to decision block 707 of
Referring further to decision block 730 of
Referring further to decision block 704 of
When the fabric table 602 is completed, the PCM writes the configured routing information that pertains to a given one of the host CPU sets into the root complex of the given host set. This enables the given host set to access each PCI adapter assigned to it by the PCM, as indicated by the received routing information. However, the given host set does not receive configured routing information for any of the other host CPU sets. Accordingly, the given host is enabled to access only the PCI adapters assigned to it by the PCM.
Usefully, the configured routing information written into the root complex of a given host comprises a subset of a tree representing the physical components of distributed computing system 300. The subset indicates only the PCI switches, adapters and bridges that can be accessed by the given host CPU set.
As a further feature, only the host CPU set containing the PCM is able to issue write operations, or writes. The remaining host CPU sets are respectively modified, to either prevent them from issuing writes entirely, or requiring them to use the PCM host set as a proxy for writes.
Referring to
Referring further to
In the event that the challenge PCM ID is found to be equal to or greater than the current PCM ID, the challenge will be recorded as being lost, as indicated by function block 816. The protocol will be exited and the procedure of
Referring further to decision block 810 of
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
The computer program code may be accessible from a computer-usable or computer-readable storage medium for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any tangible apparatus that can contain and store the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or a semiconductor system. The medium also may be physical medium or tangible medium on which computer readable program code can be stored. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk, an optical disk, or some other physical storage device configured to hold computer readable program code. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.
Number | Name | Date | Kind |
---|---|---|---|
5257353 | Blanck et al. | Oct 1993 | A |
5367695 | Narad et al. | Nov 1994 | A |
5392328 | Schmidt et al. | Feb 1995 | A |
5960213 | Wilson | Sep 1999 | A |
5968189 | Desnoyers et al. | Oct 1999 | A |
6061753 | Ericson | May 2000 | A |
6073195 | Okada | Jun 2000 | A |
6662251 | Brock et al. | Dec 2003 | B2 |
6691184 | Odenwald et al. | Feb 2004 | B2 |
6769021 | Bradley et al. | Jul 2004 | B1 |
6775750 | Krueger | Aug 2004 | B2 |
6813653 | Avery | Nov 2004 | B2 |
6907510 | Bennett et al. | Jun 2005 | B2 |
7036122 | Bennett et al. | Apr 2006 | B2 |
7096305 | Moll | Aug 2006 | B2 |
7103064 | Petty et al. | Sep 2006 | B2 |
7134052 | Bailey et al. | Nov 2006 | B2 |
7174413 | Pettey et al. | Feb 2007 | B2 |
7188209 | Pettey et al. | Mar 2007 | B2 |
7194538 | Rabe et al. | Mar 2007 | B1 |
7363389 | Collins et al. | Apr 2008 | B2 |
7398337 | Arndt et al. | Jul 2008 | B2 |
20020138677 | Brock et al. | Sep 2002 | A1 |
20020144001 | Collins et al. | Oct 2002 | A1 |
20020161937 | Odenwald et al. | Oct 2002 | A1 |
20020188701 | Brown et al. | Dec 2002 | A1 |
20030221030 | Pontius et al. | Nov 2003 | A1 |
20040015622 | Avery | Jan 2004 | A1 |
20040025166 | Adlung et al. | Feb 2004 | A1 |
20040039986 | Solomon et al. | Feb 2004 | A1 |
20040123014 | Schaefer et al. | Jun 2004 | A1 |
20040172494 | Pettey et al. | Sep 2004 | A1 |
20040179534 | Pettey et al. | Sep 2004 | A1 |
20040210754 | Barron et al. | Oct 2004 | A1 |
20040230709 | Moll | Nov 2004 | A1 |
20040230735 | Moll | Nov 2004 | A1 |
20050025119 | Pettey et al. | Feb 2005 | A1 |
20050044301 | Vasilevsky et al. | Feb 2005 | A1 |
20050102682 | Shah et al. | May 2005 | A1 |
20050147117 | Pettey et al. | Jul 2005 | A1 |
20050188116 | Brown et al. | Aug 2005 | A1 |
20050228531 | Genovker et al. | Oct 2005 | A1 |
20050270988 | DeHaemer | Dec 2005 | A1 |
20060168361 | Brown et al. | Jul 2006 | A1 |
20060179195 | Sharma et al. | Aug 2006 | A1 |
20060184711 | Pettey et al. | Aug 2006 | A1 |
20060195617 | Arndt et al. | Aug 2006 | A1 |
20060195675 | Arndt et al. | Aug 2006 | A1 |
20060206655 | Chappell et al. | Sep 2006 | A1 |
20060206936 | Liang et al. | Sep 2006 | A1 |
20060212608 | Arndt et al. | Sep 2006 | A1 |
20060212620 | Arndt et al. | Sep 2006 | A1 |
20060212870 | Arndt et al. | Sep 2006 | A1 |
20060230181 | Riley | Oct 2006 | A1 |
20060230217 | Moll | Oct 2006 | A1 |
20060239287 | Johnsen et al. | Oct 2006 | A1 |
20060242333 | Johnsen et al. | Oct 2006 | A1 |
20060242352 | Torudbakken et al. | Oct 2006 | A1 |
20060242353 | Torudbakken et al. | Oct 2006 | A1 |
20060242354 | Johnsen et al. | Oct 2006 | A1 |
20060253619 | Torudbakken et al. | Nov 2006 | A1 |
20070019637 | Boyd et al. | Jan 2007 | A1 |
20070027952 | Boyd et al. | Feb 2007 | A1 |
20070097871 | Boyd et al. | May 2007 | A1 |
20070097948 | Boyd et al. | May 2007 | A1 |
20070097949 | Boyd et al. | May 2007 | A1 |
20070097950 | Boyd et al. | May 2007 | A1 |
20070136458 | Boyd et al. | Jun 2007 | A1 |
Number | Date | Country |
---|---|---|
2006089914 | Aug 2006 | WO |
Number | Date | Country | |
---|---|---|---|
20070101016 A1 | May 2007 | US |