The present disclosure relates to an input/output (IO) device that connects multiple servers to one or more network interfaces.
In an enterprise computing environment, host servers running one or more processes communicate with other devices in a network via individual input/output (IO) devices. In one example, the host servers connect to the IO devices in accordance with a computer expansion card standard, such as the Peripheral Component Interconnect Express (PCIe) standard.
Enterprise computing environments continue to grow in scale, complexity and connectivity. Virtualization technologies have been used in a number of manners to address such issues, but have not been fully exploited for use in IO devices.
Overview
An IO device is provided for connecting multiple servers to one or more network interfaces. The IO device includes a network connection module that comprises a plurality of network interfaces, and a virtual host interface configured to communicate with a plurality of host servers. The IO device also includes an input/output (IO) controller configured to connect each of the host servers to one or more of the network interfaces such that the connections between each host server and corresponding one or more network interfaces are operationally isolated and independent from one another.
Example Embodiments
IO device 100 also comprises a network connection module 18 that includes a plurality of network interfaces (not shown in
Device 100 includes a network connection module 18 comprising network control logic 48 and eight network interfaces 52(1)-52(8) each providing a corresponding communication link 50(1)-50(8). In one form, network interfaces are 10 Gigabit Serial Electrical Interfaces (XFI interfaces). These interfaces support 10 Gigabit Ethernet (GE) port channel, or 40 GE when bundled into groups of four interfaces. Each interface 52(1)-52(8) may also support Serial Gigabit Media Independent Interface (SGMII) transfer at 1 GE speed. The number of interfaces and communication links may depend on, for example, the number of host servers 20, selected configurations, networks used, etc. Additionally, the networks 54(1)-54(8) may be the same or different networks, again depending on the configurations selected by host servers 20.
Device 100 may also include a separate SGMII port 38 configured to connect to Baseboard Management Controller (BMC) interfaces of respective host servers 20(1)-20(8). Port 38 may also support Network Controller Sideband Interface (NCSI) transfer. Additionally, device 100 may include memory 39 in the form of double data rate type three synchronous dynamic random access memory (DDR3 SDRAM) having a high bandwidth interface (e.g., 4 GB max capacity) that may be used for, as an example, burst packet buffering, management protocols, PCIe configuration and virtualization structures, exchange table management, flow tables, and other control structures. Device 100 may also include other interfaces, such as a 16-bit parallel flash interface, a Serial Peripheral Interface (SPI), two wire (I2C) interface, universal asynchronous receiver/transmitter (UART), Management Data IO (MDIO) interface, General Purpose IO interface (GPIO), and/or Joint Test Action Group (JTAG) interface. Such interfaces are options for different forms of device 100, and, for ease of illustration, have not been included in
IO device 100 may operate with host servers 20(1)-20(8) having a number of different configurations.
For ease of illustration, the implementation details of the remaining seven host servers 20(2)-20(8) have been omitted. However, it would be appreciated that host servers 20(2)-20(8) may be the same as described above with reference to server 20(1) or may have a different implementation.
The communication links 12(1)-12(8) between host servers 20(1)-20(8) and device 100 are enabled according to the Peripheral Component Interconnect Express (PCIe) standard that is virtualized at virtual host interface 14. At the physical level, a PCIe link comprises one or more lanes. Each lane is composed of transmit and receive pairs of differential lines. As such, each lane is composed of 4 wires or signal paths, configured to transport data packets between endpoints of the link. A link may include one to thirty-two lanes, in powers of two (2, 4, 8, 16 and 32). In the arrangement of
In the arrangement of
As noted above, in the arrangement of
Virtual host interface 14 includes a number of vNICS 32(1), 32(2), etc. As described below, each vNIC is independently allocated to one of the PCIe ports 30(1)-30(8), and thus to one of the host servers 20(1)-20(8), by an IO controller 16 of IO device 100. Each port 30(1)-30(8) may include a plurality of vNICs but, for ease of illustration, only two vNICs, vNICs 32(1) and 32(2), are shown in port 30(1). Additionally, each virtual port 30(1)-30(8) includes its own clock domain 36 that is driven by a clock associated with that port, and thus is independent from the system clock of device 100. Additionally, each port 30(1)-30(8) has its own reset domain 34 that is isolated from the reset domains of other ports 30(2)-30(8) and from the central reset domain of device 100. This clock and reset isolation is represented in
Each server 20(1)-20(8) is connected to one or more network interfaces 52(1)-52(8) in network connection module 18 such that data received by device 100 from one of the host servers 20(1)-20(8) is passed through to the interfaces. In addition to virtual host interface 14, device 100 includes a number of other hardware and software elements that facilitate the connection of servers 20 to interfaces 52. These elements are collectively shown in
The IO controller 16 comprises a processor 44, a scheduler 43, and memory 42 that stores software executable by the processor 44 for performing various control functions in the IO device 100. Scheduler 43 is a dedicated piece of hardware that is configured by processor 44.
When a host server 20 connects to device 100, it observes a private IO subsystem (private PCIe tree with multiple vNICs) that it interprets as being configurable to its own specifications. That is, when connected, each host server 20(1)-20(8) is not aware that it is sharing a common device with the other host servers, and the host server is permitted to determine what connections it desires with one or more interfaces 52(1)-52(8). This virtualization layer allows IO device 100 to present heterogeneous vNIC configurations and addressing to each of the host servers 20(1)-20(8) as required by each host server's Basic IO System (BIOS). As such, IO controller 16 receives host-selected configuration data, referred to as PCIe transactions, from each host server 20(1)-20(8). IO controller 16 responds to the PCIe transactions as needed, and uses the transactions to configure the virtual PCIe topology or space for a given one of the host servers 20(1)-20(8).
Because the configurations selected by each host server 20(1)-20(8) do not account for the requested configurations of the other servers, there may be colliding information (e.g., addresses). Instead of notifying host servers 20(1)-20(8) of such collisions, as processor 44 builds the virtual PCIe topology for a given host server it also maps the topology to the transmit and receive resource instances of transmit and receive module 40. For example, in one form a base address register (BAR) describes the address of a transmit and receive resource instance with respect to the private PCIe topology of host server 20(1). However, because this address is private to host server 20(1), processor 44 maps or correlates the BAR address to an address that identifies the transmit and receive resource instance uniquely. This ensures that each transmit and receive resource instance is mapped to the host server 20(1), and that there is no overlap of the instance with other host servers 20(2)-20(8). Once completed, the mapped configuration is maintained by the transmit and receive resource instances thereby allowing the virtual devices to operate at full speed.
The above PCIe topology generation and associated mapping is performed for all host servers 20(1)-20(8) connected to device 100. Furthermore, in operation, host servers 20(1)-20(8) are prevented from addressing transmit and receive resource instances that have not been mapped to them. Additionally, because IO device 100 maintains one to one mapping of instances to servers, resource instances mapped to a particular host server may not access memory or other resources associated with other hosts servers.
The integration of IO operations into a single device provides advantages in scheduling. In one form, host servers 20(1)-20(8) will compete for limited resources, such as bandwidth. However, because IO device 100 is a central location for all IO transactions, the device can schedule bandwidth between host servers 20(1)-20(8). As such, the scheduler 43 enforces service levels according to configured policies. The scheduler 43 has visibility across all vNICs and queues in the system, allowing priority groups, rate limiting, and CIR (Committed Information Rate) to be scheduled across vNICs as well as across host servers 20(1)-20(8). Therefore, host servers 20(1)-20(8) and resources may be parceled out according to any preselected policy, allowing for “universal” scheduling. This traffic scheduling may be performed for egress or ingress data traffic.
Large bursts of Ethernet traffic targeting a single host server 20 is an area of concern. As noted below, in one form, device 100 includes the ability to buffer some level of Ethernet traffic. However, due to the virtual topology created within device 100, any single or group of virtual PCIe devices can utilize the full uplink bandwidth, can be rate limited to a target bandwidth, or can share bandwidth according to a policy.
As previously noted, each virtual port 30(1)-30(8) includes its own clock domain 36 that is independent from the system clock of device 100. Each virtual port 30(1)-30(8) also includes its own reset domain 34 that is isolated from the reset domains of other ports and from the central reset domain of device 100. Due to these private clock and reset domains, the vNICs 32(1)-32(N) for each of the links 12(1)-12(8) are isolated from one another and, as such, the links and host servers 20(1)-20(8) are operationally isolated and independent from one another. This isolation ensures that the operation of one host server does not affect the operation of other host servers. That is, a host server may reboot, enumerate PCIe, power cycle, or be removed from device 100 at any time without disturbing the operation of other attached host servers. A surprise hot plug event, for example, will terminate any pending transaction to that host server with error response completions back to resources of IO device 100. All internal direct memory access (DMA) engines track error state on a per-queue and per-vNIC basis, so individual vNICs assigned to removed host servers will experience fatal error conditions and report them, while other vNICs operate continuously without error.
In summary, because each host server has its own PCIe tree and no knowledge of other host servers, each host server can select its own desired transmit/receive configuration (private interrupt mapping and assignment space and private ownership of its own devices). In other words, the host server boots, configures and uses its devices, and there is no change to the host server control model and no change in the drivers. Therefore, there is no need for extensions such as Multi-Root IO Virtualization (MR-IOV) or Single-Root IO Virtualization (SR-IOV), although support for SR-IOV may be provided. Additionally, each host server cannot disturb its peers, either maliciously or accidently. Host servers can be removed or rebooted at any time without affecting one another, and can re-enumerate their PCIe topology at any time.
After host server 20(1) is reset or turned on, the server's BIOS or OS probes its attached PCIe bus via PCIe configuration transactions that define the PCIe topology desired by server 20(1). Method 300 begins at 310 wherein the PCIe configuration transactions are received from host server 20(1). More specifically, the transactions are received by processor 44 in IO controller 16. Processor 44 responds to the configurations transactions, as needed, and maintains a database of the desired PCIe topology and device type associated with server 20(1). This database also includes the desired PCIe topologies of the other servers 20(2)-20(8) and their device types.
Method 300 continues at 320 where processor 44 generates a virtual PCIe topology for host server 20(1) to communicate with network interfaces 52(1)-52(8). Connection between host server 20(1) and network interfaces 52(1)-52(8) is provided via virtualized host interface 14 and transmit and receive module 40. As previously noted, transmit and receive module 40 includes multiple instances of transmit and receive resources. At 330, processor 44 maps the generated virtual topology to instances of the transmit and receive resources. The connections between host server 20(1) and network interfaces 52(1)-52(8) are operationally isolated and independent from the connections of other servers 20(2)-20(8). As noted above, in one form processor 44 ensures each transmit and receive resource instance is mapped to the host server 20(1), and that there is no overlap of the instances with other host servers. Once completed, the mapped configuration is maintained by the transmit and receive resource instances.
As noted above, due to the independence and isolation of the host servers 20(1)-20(8) and the paths to interfaces 52(1)-52(8), powering off or sudden reset of an individual server does not impact the operation of other servers attached to device 100. If a server 20, such as server 20(1), is powered off or reset suddenly, processor 44 clears the mapping performed above at 64 and frees the resources associated with the server. When server 20(1) is powered on, operations 310-330 may be repeated.
Returning to
RC 70 allows one or more physical PCIe device to be attached to device 100. The attached PCIe device may be controlled by processors, such as processor 44, in 10 device 100. That is, device 100 controls the PCIe endpoint devices attached to RC 70, thereby allowing device 100 to run the physical driver of, for example, SR-IOV devices. This control further allows mapping of each function of that device to individual hosts, which in turn run the native driver of the mapped function. This allows third party PCIe devices to be integrated with other virtual devices in a way that does not need to be exposed to host servers 20(1)-20(8). Example implementations of RC 70 are provided below.
Each RC port 80(1) and 80(2) has a private PCIe space enumerated by processor drivers and is assigned local BARs by processor 44. In one form, processor 44 may maintain total control over its local devices running its own Linux drivers. For example, in
In another form, processor 44 may map partial or entire functions or devices to the attached host servers 20(1)-20(8). This is especially useful for SR-IOV capable devices, which often support 16 functions plus a physical device driver. An example of one SR-IOV device in the form of an SR-IOV storage system 82 is shown attached to port PCIe 80(2) of
When SR-IOV storage system 82 is attached, processor 44 will run the physical system driver locally and will map individual functions to individual host servers 20(1)-20(8). Because device 100 has virtualized the PCIe topology, device 100 can translate IO operations between topologies without support from the device drivers. As such, the functions of system 82 may be separated and the individual functions may be added to one or more PCIe server topologies.
Device 100 will maintain the address and configuration space mapping such that each attached host server 20(1)-20(8) sees only the function(s) mapped to its local PCIe topology. Host servers 20(1)-20(8) will enumerate the mapped function using its BIOS and assign BARS in its local PCIe address space. This allows host servers 20(1)-20(8) to run the native function driver, completely isolated from its neighboring host servers.
An individual host server may reboot and re-enumerate its virtual PCIe topology without disturbing operation of other attached host servers. In this event, the processor 44 will issue a function reset to the mapped functions from storage system 82. Logic within RC 70 includes a table that maps PCIe Bidirectional Forwarding Detection (BDF) numbers to internal vNICs of device 100, which are in turn assigned to host virtual switch BDFs as transactions travel upstream.
In another form, the local memory resources of device 100 can also be used as a resource to virtualize standard devices. In these cases, processor 44 handles the driver translation tasks. More specifically, an attached PCIe device may not be a sharable device. In this case, processor 44 may take control of the physical device and function as a proxy between the physical PCIe device and a host server. After processor 44 takes control of the PCIe device, the processor 44 gives functions to a host server. When a device request is made by a host server 20(1)-20(8), the requests are proxied through processor 44. In this way, a layer of control software, similar to a hypervisor, is utilized. The proxy process ensures host servers 20(1)-20(8) will not collide when requesting physical device services. Therefore, if an attached PCIe device is not sharable, the processor functions as proxy for its functions between the device and a host server 20(1)-20(8) that uses the functions.
As previously noted,
As noted throughout the above description, the various forms of IO device 100 provide a number of features and advantages. For example, in one form, IO device 100 may support multiple independent host servers, greatly reducing the cost and power of a server cluster. This server aggregation architecture reduces network latency, allows each individual server to burst data at the full cluster uplink bandwidth, can absorb large bursts to a single host, and provides all servers with centralized management and network services not available from traditional network interface cards, thereby allowing consolidated policies to be applied across groups or classes of devices. Additionally, each host server interface is fully virtualized and isolated from the interfaces of other host servers and, accordingly, supports hot plug. In another form, failover operations between two virtual or real devices connected to one or two 10 devices are provided. This is possible because IO device 100 has completely virtualized the PCIe topology and can take over or re-direct device interface commands and responses from a host server at any time.
Aspects of device 100 have been described with reference to a single processor 44. It would be appreciated that the use of one processor is merely illustrative, and more than one processor may be used for any of the above operations. For example, in one form, device 100 includes five identical or different processors.
The above description is intended by way of example only.
Number | Name | Date | Kind |
---|---|---|---|
7613847 | Kjos et al. | Nov 2009 | B2 |
7669000 | Sharma et al. | Feb 2010 | B2 |
7916628 | Ghosh et al. | Mar 2011 | B2 |
8031634 | Artzi et al. | Oct 2011 | B1 |
8174969 | Kommidi et al. | May 2012 | B1 |
8386659 | Jung et al. | Feb 2013 | B2 |
8503468 | Akyol et al. | Aug 2013 | B2 |
8595343 | Cherian et al. | Nov 2013 | B2 |
8706938 | Freking et al. | Apr 2014 | B2 |
8776050 | Plouffe et al. | Jul 2014 | B2 |
8862706 | Wittenschlaeger | Oct 2014 | B2 |
20050038947 | Lueck et al. | Feb 2005 | A1 |
20060212751 | Yamamoto et al. | Sep 2006 | A1 |
20070112996 | Manula et al. | May 2007 | A1 |
20070266179 | Chavan et al. | Nov 2007 | A1 |
20080052431 | Freking et al. | Feb 2008 | A1 |
20080147959 | Freimuth et al. | Jun 2008 | A1 |
20080263246 | Larson et al. | Oct 2008 | A1 |
20090003361 | Bakthavathsalam | Jan 2009 | A1 |
20090010279 | Tsang et al. | Jan 2009 | A1 |
20090077297 | Zhao et al. | Mar 2009 | A1 |
20090204736 | Xie et al. | Aug 2009 | A1 |
20090216910 | Duchesneau | Aug 2009 | A1 |
20090304002 | Yu et al. | Dec 2009 | A1 |
20100088431 | Oshins et al. | Apr 2010 | A1 |
20100165874 | Brown et al. | Jul 2010 | A1 |
20100183028 | Howard et al. | Jul 2010 | A1 |
20100275199 | Smith et al. | Oct 2010 | A1 |
20110078381 | Heinrich et al. | Mar 2011 | A1 |
20110099319 | Mukherjee et al. | Apr 2011 | A1 |
20110106981 | Watkins et al. | May 2011 | A1 |
20110116513 | Gilson | May 2011 | A1 |
20110128970 | Breton et al. | Jun 2011 | A1 |
20110179317 | Yamamoto et al. | Jul 2011 | A1 |
20110219164 | Suzuki et al. | Sep 2011 | A1 |
20110258641 | Armstrong et al. | Oct 2011 | A1 |
20110264840 | Loffink | Oct 2011 | A1 |
20120166690 | Regula | Jun 2012 | A1 |
20120233607 | Kuroda et al. | Sep 2012 | A1 |
20120260015 | Gay et al. | Oct 2012 | A1 |
20120265910 | Galles et al. | Oct 2012 | A1 |
20130117469 | Corrigan et al. | May 2013 | A1 |
20130339558 | Zhu et al. | Dec 2013 | A1 |
20140006679 | Eguchi et al. | Jan 2014 | A1 |
20140019794 | Chandhoke et al. | Jan 2014 | A1 |
Entry |
---|
“Multi-Root I/O Virtualization and Sharing Specification Revision 1.0”, PCI-SIG®, www.pcisig.com, May 12, 2008, 260 pages. |
“NextIO: vConnect Overview”, Aug. 10, 2011, http://www.nextio.com/show.php?page=products—overview&phpMyAdmin=Hs4JByK%2 . . . , 1 page. |
Number | Date | Country | |
---|---|---|---|
20130042019 A1 | Feb 2013 | US |