A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.
The present disclosure relates generally and without limitation to the field of data buses, devices, interconnects such as e.g., fabrics, and networking and specifically, in one or more exemplary embodiments, to methods and apparatus for providing interconnection and data routing within fabrics comprising one or more host devices or processes and one or more destination devices or processes, including for migration of processes such as e.g., virtual machines (VMs).
The presentation of virtualized device interfaces, such as e.g., the virtualization of a configuration space of a PCI family device (such as PCI, PCI-X, PCIe, CardBus, CXL, etc.) is known. The concept of virtualized device interfaces can further be generalized beyond PCI-family devices to any other device protocol which supports an ID-based register access space (aka the “configuration space” in the PCIe context).
Recent advances in, inter alia, switch technology allow for more complex transaction routing in switch designs. These advances allow for moving devices from the purview of one host to that of another host (or to Virtual Machine or VM on those hosts).
However, extant configurations suffer from several disabilities, including for instance that they do not allow for the hot-migration of ownership from one host to another, either as base hosts, or as hosts that feature Hypervisor systems on them supporting Virtual Machines, where on the Virtual Machine load will be migrated.
Accordingly, improved apparatus, methods, and systems are needed to address the foregoing, including providing improved efficiency and reduced latency for such VM migrations between originating and destination virtual machine device instantiations. More specifically, such apparatus, methods, and systems could, inter alia, substantially reduce the time that I/O traffic must be halted during migration and minimize virtual machine (VM) downtime in such migrations, thereby reducing latency.
The present disclosure addresses the foregoing deficiencies by providing, inter alia, apparatus, systems, methods, and computer-readable apparatus for supporting, inter alia, efficient and effective process migration and pausing of data communications within a fabric or other networked system.
In one aspect of the disclosure, methods and apparatus for migration of one or more processes within a data processing system are disclosed. In one embodiment, an originating function, process or entity (e.g., a Virtual Device Instantiation) exists during migration, and during migration a destination function, process or entity (e.g., a second Virtual Device Instantiation) exists at the same time as the originating VDI for at least a period of time, e.g., simultaneous operation (at least in part) by the Source VDI and the Destination (VDI).
In one implementation, the existence during migration comprises one or more aspect of the destination function, entity or process (e.g., VDI) is/are detectable or accessible by another destination-side entity (e.g., destination Physical Host). This detectability or accessibility may include, without limitation, the ability to send or receive protocol transactions to the VDI by its respective host such as reading or writing registers, sending or receiving interrupts, accepting upstream message types for detection or error reporting or experiencing any detectable side-effect of any of the destination Virtual Device Instantiation by a destination Physical Host (PH), while at the same time allowing one or more functions or processes to be performed on the originating Virtual Device Instantiation, such as by the originating Physical Host (including any originating VM processes).
In another aspect of the disclosure, methods and apparatus for the reduction of delay or latency within e.g., a migration operation, are disclosed. In one embodiment, simultaneous operation of source and destination functions, processes or entities (e.g., VDIs) allows the source host (and VM) to continue operations to the I/O device while the destination host undergoes migration setup operations. This significantly reduces the time where I/O traffic must be blocked from ongoing operations during the migration.
In a further aspect of the disclosure, methods and apparatus for pausing bus traffic are disclosed. In one embodiment, the pausing is implemented during a VM migration process, and I/O traffic (bus protocol traffic) (i) from the physical endpoint device function to the originating host (or destination host), and/or (ii) from the originating host or destination host to the physical endpoint device function, is paused. In one implementation, the pausing occurs inside an I/O traffic switch routing fabric itself (not in the physical endpoint device function) and not in the physical host functions, such that the physical endpoint device function's and physical host systems only awareness of such a pause may occur if/when the fabric's flow control mechanisms are invoked to prevent the endpoint device from transmitting additional I/O (because such entities detect a protocol flow control stop (backflow) condition).
In one variant of the foregoing, I/O traffic routing is then changed within the I/O routing fabric (during VM migration), and the fabric I/O routing operations that were previously paused are then unpaused or otherwise permitted to process and, such traffic is then unblocked and completed at e.g., the destination Physical Host and real endpoint function.
In yet a further aspect of the disclosure, methods and apparatus configured to provide enhanced pausing and blocking behavior are described. In one embodiment, such methods and apparatus implement a specific process sequence that allows I/O traffic to be blocked (e.g., using a traffic routing block/pause function that causes a flow control condition) and wherein the originating host and destination host may have different bus IDs (fabric address ID's), but which allows non-address-routed I/O transactions to drain from the I/O subsystem and complete before all remaining traffic is blocked using the traffic flow control block mechanism. The use of the special drain sequence mechanism ensures that only address-routed traffic is present in the I/O subsystem when traffic is blocked using the address pause mechanism. This process sequence is highly advantageous because, inter alia, it prevents non-address-routed traffic from being present in the blocked back-pressured I/O fabric queues, as such traffic cannot be successfully routed after a fabric route change. Thus, the exemplary process sequence allows address-routed traffic to be paused and re-routed after unpausing, and further allows non-address-routed traffic (such as read completions and configuration read completions) to be drained from the system without knowledge by the VM (or another host software) prior to the I/O traffic pause. As such, both host and device are unaware of the traffic pause and traffic rerouting (other than via the passage of time). Of further benefit is the fact that in the exemplary embodiments, no special operations in the VM need take place, other than the stopping of the execution of the VM in processing VM guest OS or VM application software execution.
In another embodiment, such methods and apparatus implement a specific process sequence that allows, inter alia, I/O traffic to be blocked and wherein the originating and destination hosts must have the same fabric IDs. In this embodiment, the draining of non-address-routed traffic during the migration route switch over process is not necessary and only the flow control blocking mechanism is used without regard for draining non-address-routed traffic before blocking.
In another aspect of the disclosure, a system is described. In one embodiment, the system comprises one or more data fabrics (e.g., a PCIe-compliant switch fabric), a source host (e.g., a server system), and a destination host. The source and destination hosts each have one or more virtualized domains or endpoints associated therewith. Within the fabric/switch, one or more pause points for traffic are defined, and a protocol implemented whereby all types of traffic are blocked from arrival at the real physical function, but where such operations are completed within the switch fabric by the VDI emulation implementation instead of the real fabric device. The VDI emulation supported types of operations can proceed (at their respected hosts), such as during a migration of a VM from one of the hosts to the other. In another embodiment, a pausing/blocking mechanism which allows at least two of the virtualized endpoints (and at least portions of their respective hosts) to simultaneously communicate with the real endpoint allows fewer traffic types to be completed solely by the destination VDI emulation (as they can be conveyed to the real endpoint for complete) so as to, inter alia, reduce latency.
In another aspect, methods and apparatus for providing indication of a completed transaction are disclosed. In one embodiment, an entity (e.g., VDI) is used to appear to one or more other entities or processes to complete transactions. In one implementation, two use cases exist; i.e., 1) no-simultaneous routing allowed, and 2) simultaneous routing allowed. In no simultaneous routing, the destination VDI is used as a complete or comprehensive emulator, because it cannot “talk” to the real endpoint until the source is finished with it (because only one route is allowed). In simultaneous routing capable fabrics, two or more routes can coexist simultaneously, and in that case the destination VDI may elect to pass through some transactions to the real endpoint device (and have the endpoint complete them) rather than to emulate them fully in the VDI and complete them in the VDI without the real endpoint ever having been aware of them.
In another aspect, a computer readable apparatus is disclosed. In one embodiment, the computer readable apparatus comprises a storage device (e.g., SSD, HDD, memory) associated with one or more hosts of the foregoing system.
In a further aspect, a fabric is disclosed. In one embodiment, the fabric comprises a PCIe-compliant switch or switching fabric configured for use with one or more pause points and two or more virtualized endpoints (e.g., VDIs) and so as to implement the pausing/blocking mechanism described herein, as well as reduction of setup and other latencies associated with e.g., VM migration.
In a further aspect, a method of migrating virtualized entities (e.g., VMs) is disclosed.
In yet another aspect, a host device is disclosed. In one embodiment, the host device comprises a computerized apparatus (e.g., PC, cluster node, server, or server blade) having one or more VMs associated therewith, and capable of data communication with a data fabric (e.g., using PCIe-based protocols) as well as one or more other hosts.
Other features and advantages of the present disclosure will immediately be recognized by persons of ordinary skill in the art with reference to the attached drawing(s) and detailed description of exemplary embodiments as given below.
All Figures disclosed herein are © Copyright 2022 GigaIO Networks. All rights reserved.
Reference is now made to the drawings wherein like numerals refer to like parts throughout.
As used herein, the term “application” (or “app”) refers generally and without limitation to a unit of executable software that implements a certain functionality or theme. The themes of applications vary broadly across any number of disciplines and functions (such as on-demand content management, e-commerce transactions, brokerage transactions, home entertainment, calculator etc.), and one application may have more than one theme. The unit of executable software generally runs in a predetermined environment; for example, the unit could include a downloadable Java Xlet™ that runs within the JavaTV™ environment.
As used herein, the term “computer program” or “software” is meant to include any sequence or human or machine cognizable steps which perform a function. Such program may be rendered in virtually any programming language or environment including, for example, C/C++, Fortran, COBOL, PASCAL, Python, Ruby, assembly language, markup languages (e.g., HTML, SGML, XML, VoXML), and the like, as well as object-oriented environments such as the Common Object Request Broker Architecture (CORBA), Java™ (including J2ME, Java Beans, etc.) and the like, and may comprise one or more applications.
As used herein, the terms “device” or “host device” include, but are not limited to, servers or server farms, set-top boxes (e.g., DSTBs), gateways, modems, personal computers (PCs), and minicomputers, whether desktop, laptop, or otherwise, as well as mobile devices such as handheld computers, PDAs, personal media devices (PMDs), tablets, “phablets”, smartphones, vehicle infotainment systems or portions thereof, distributed computing systems or clusters, VR and AR systems, gaming systems, or any other computerized device.
As used herein, the terms “Internet” and “internet” are used interchangeably to refer to inter-networks including, without limitation, the Internet. Other common examples include but are not limited to: a network of external servers, “cloud” entities (such as memory or storage not local to a device, storage generally accessible at any time via a network connection, and the like), service nodes, access points, controller devices, client devices, etc.
As used herein, the term “memory” includes any type of integrated circuit or other storage device adapted for storing digital data including, without limitation, ROM, PROM, EEPROM, DRAM, SDRAM, DDR/2/3/4/5/6 SDRAM, EDO/FPMS, RLDRAM, SRAM, “flash” memory (e.g., NAND/NOR), 3D memory, and PSRAM.
As used herein, the terms “microprocessor” and “processor” or “digital processor” are meant generally to include all types of digital processing devices including, without limitation, digital signal processors (DSPs), reduced instruction set computers (RISC), general-purpose (CISC) processors, GPUs (graphics processing units), microprocessors, gate arrays (e.g., FPGAs), PLDs, reconfigurable computer fabrics (RCFs), array processors, secure microprocessors, and application-specific integrated circuits (ASICs). Such digital processors may be contained on a single unitary IC die, or distributed across multiple components.
As used herein, the term “network” refers without limitation to any wireline, wireless, optical, or other medium capable of transmitting data between two or more devices, entities or processes.
As used herein, the term “network interface” refers to any signal or data interface with a component or network including, without limitation, those of the PCI, PCIe, FireWire (e.g., FW400, FW800, etc.), USB (e.g., USB 2.0, 3.0. OTG), Ethernet (e.g., 10/100/1000 (Gigabit Ethernet), 10-Gig-E, etc.), families.
As used herein, the term “PCIe” or “Peripheral Component Interconnect Express” refers without limitation to the technology described in PCI-Express Base Specification, Version 1.0a (2003), Version 1.1 (Mar. 8, 2005), Version 2.0 (Dec. 20, 2006), Version 2.1 (Mar. 4, 2009), Version 3.0 (Oct. 23, 2014), Version 3.1 (Dec. 7, 2015), Version 4.0 (Oct. 5, 2017), and Version 5.0 (Jun. 5, 2018, and May 2019), each of the foregoing incorporated herein by reference in its entirety, and any subsequent versions thereof.
As used herein, the term “server” refers without limitation to any computerized component, system or entity regardless of form which is adapted to provide data, files, applications, content, or other services to one or more other devices or entities on a computer network.
As used herein, the term “storage” refers without limitation to computer hard drives, DVR device, memory, RAID devices or arrays, SSDs, optical media (e.g., CD-ROMs, Laserdiscs, DVD, Blu-Ray, etc.), or any other devices or media capable of storing content or other information.
The present disclosure describes methods and apparatus for, among other things, migrating one or more processes within a data processing system.
In one aspect, a system allows and makes use of two or more simultaneous VDIs (e.g., virtual configuration space(s)) being exposed, e.g., one to the originator (e.g., source host), and a second one to the destination (e.g., destination host). That is, in exemplary implementations described herein, the originating and destination Virtual Device Instantiations exist for at least a period of time simultaneously.
By creating a configuration-space-only DEST VDI instance, the DEST OS can then do enumeration, discovery, and even driver load, which saves significant amounts of time (microseconds to multi-seconds of time), thereby shortening overall migration times.
Partial operation of a DEST VDI entails supporting simulated configuration reads and write protocol cycles, and generally will “end”, and will entail DEST VDI operation to block on wait-on-termination the SOURCE VDI in a range of responses. At the earliest point, the configuration space write of the device's control bits (in the PCIe bus instantiation the CONTROL register) for the MEMORY SPACE enable bit, and at the latest point, device specific (DEST VDI implementation specific based on device class/vendor/model) memory address offsets in the memory address space supported by the DEST VDI.
Specifically, in one variant, the destination Virtual Device Instantiation comes into being (is detectable or accessible) while at the same time, the previously created originating Virtual Device Instantiation remains in existence (advantageously allowing the originating Physical Host to perform at least some detection and access operations during that time period, while also allowing the destination Physical Host(s) to use the Destination Virtual Device Instantiation to perform at least some detection and access operations). Extant solutions preclude such simultaneous operation of originating and destination Virtual Machine Device Instantiations.
The exemplary methodology, apparatus and systems described herein also seek to first pause downstream traffic within the fabric at e.g., a blocking point (the downstream blocking point), and allowing downstream commands that have already passed this blocking point to drain downstream to the physical endpoint device function. In the case of commands that require immediate (or near immediate) responses, the physical endpoint device function is able to complete (and thus drain) such commands from the outstanding set from the host such that only I/O traffic commands that require completions that have not passed the downstream blocking point are outstanding. This approach allows all completion type traffic (for instance in the exemplary PCIe I/O protocol, memory read completions, configuration cycle read completions) for I/O traffic commands that have passed the downstream blocking point may complete in the upstream direction before upstream traffic is blocked (at the upstream blocking point) within the I/O fabric.
Additionally, the physical endpoint device function is allowed to keep sending upstream traffic that is not completion-based upstream (from device to host) at its leisure until such time as protocol flow control prevents the device from transmitting upstream (only because flow control conditions prevent it from doing so, as described above), and all return device-to-host completion traffic for the corresponding downstream I/O command flows is drained. Eventually, such traffic would not include completion traffic, because such completion traffic was allowed to complete before blocking the upstream traffic at the fabric upstream blocking point for this physical endpoint device function. This mechanism advantageously allows, inter alia, the physical endpoint device function to be unaware of any pausing, quiescing, or identity change that occurs between its communication.
Exemplary embodiments of the present disclosure are now described in detail. While these embodiments are primarily discussed in the context of a PCIe-based component or system and related methods, such as those exemplary embodiments described herein and/or those compliant with the PCIe Base Specification e.g., Revision 3.0, 4.0, 5.0 or later, each incorporated herein by reference in its entirety, and/or those set forth in U.S. patent application Ser. No. 16/566,829 filed Sep. 10, 2019 and entitled “Methods and Apparatus for High-Speed Data Bus Connection and Fabric Management,” Ser. No. 17/079,288 filed Oct. 23, 2020 and entitled “Methods and Apparatus for DMA Engine Descriptors for High Speed Data Systems,” Ser. No. 17/061,366 filed Oct. 1, 2020 and entitled “Method and Apparatus for Fabric Interface Polling,” Ser. No. 17/016,228 filed Sep. 9, 2020 and entitled “Methods and Apparatus for Network Interface Fabric Send/Receive Operations,” and Ser. No. 17/016,269 filed Sep. 9, 2020 and entitled “Methods and Apparatus for Improved Polling Efficiency in Network Interface Fabrics,” each of the foregoing previously incorporated herein, as well as e.g., those set forth in U.S. Pat. Nos. 9,448,957, 9,152,597, 8,868,777, and 8,463,934, each entitled “Unified system area network and switch” and incorporated herein by reference in its entirety, the various aspects of the present disclosure are in no way so limited, and in fact may be used in any number of other applications and/or system architectures or topologies (whether PCIe-based or otherwise), the foregoing being merely exemplary. In fact, the various aspects of the disclosure are useful with, inter alia, other types of fabrics, bus architectures, and protocols.
In the context of the exemplary embodiments, it is useful to further define certain terms or concepts for purposes of illustration of the various aspects of the present disclosure. It will be appreciated that, as with the terms set forth supra, the various aspects of the disclosure are in no way limited by such definitions or terms, unless specifically stated herein.
Referring now to
As illustrated in
The system 100 of
In the system 100, one or more destination routes 116 are also provided; in one implementation, these comprise route(s) within the I/O fabric which the I/O transactions follow (are routed upon) between the destination host and the device. The destination route 116 is in one embodiment defined by switch-specific rules that describe to where and in what manner I/O transactions from one particular PCIe tree (fabric) bus, device, function that describes a real endpoint device as being routed relative to one particular PCIe tree (fabric), bus, device, function that describes the source host. As with the source route 108 referenced above, the destination route 116 rules are logical, but generally have a physical hardware routing analog within the switch or switch fabric.
The foregoing routes (108 and 116, respectively) can be thought of in another way; i.e., a source to destination route change, with source to destination VDI instantiations, and typically with a migration of VM (although a migration of just hosts, without VM migration, can also be performed). As described in greater detail elsewhere herein, these routes (108 and 116, respectively) support at least functionalities:
In the illustrated embodiment of
As shown in
One or more source virtual machine instances 122 (e.g., Source VM, VM—Source) are also utilized; in one embodiment, these comprise a software instantiation and simulation or virtualization of a physical system that is comprised of software elements. Such a “computer system” runs on simulated hardware in most cases and not directly on actual physical hardware (as does a “host system”). Some VM however are given direct access and control over specific real hardware elements, such as specific PCIe devices. In some implementations, aspects of such direct control over actual hardware by a virtual machine (VM), and by its hosting hypervisor software running on the base host system, are provided by the mechanisms described herein.
A source PCIe transparent pathway 124 is also illustrated on
A destination host system 126 (e.g., Destination Host) is also illustrated on
One or more destination Virtual Machine instances 128 (e.g., Destination VM, Destination Virtual Machine, VM—DEST) are present in the system 100 of
The exemplary system 100 of
Also illustrated in
Similarly, one or more upstream pause points 136 (e.g., Upstream Traffic Device Pause Point {Block Point, Pause Point}) may be utilized. In the exemplary embodiment, this upstream pause point 136 is a point within the switch or switch fabric wherein I/O traffic traveling in the upstream direction (from device to host) can be blocked/paused, such as via using a switch or switch fabric traffic flow pause mechanism. When traffic is paused in such a manner, the switch will continue to accept upstream I/O traffic if it has remaining flow control credits that allow for transaction ingress in the upstream direction, or other handling mechanism exists. Upstream traffic will not be further routed in the switch or switch fabric past this point when traffic is paused. When traffic is resumed (unpaused, unblocked), then these upstream destined I/O transactions will be routed according to the routing rules in place when routing of traffic is resumed.
It is noted that the downstream and upstream pause points described above may or may not, depending on the configuration, may comprise the same point(s).
Yet further, a destination host downstream traffic device pause point 138 (e.g., Destination Downstream {Block Point, Pause Point}) is included within the system 100. In one embodiment, this destination host downstream traffic device pause point 138 is a point within the switch or switch fabric wherein I/O traffic traveling in the downstream direction (from destination host to device; contrast source host described above) can be blocked using e.g., a switch or switch fabric traffic flow pause mechanism. Similar to the other pause points above (134, 136), the destination host downstream pause point 138 may be configured such that when traffic is blocked/paused, the switch or switch fabric will continue to accept downstream flow traffic destined for the device if it has remaining flow credits that allow for transaction ingress (or another available handling mechanism). Downstream traffic will not be further routed in the switch or switch fabric past this point when traffic is paused. When traffic is resumed (unpaused, unblocked) then these Downstream destined I/O transactions will be routed according to the routing rules in place when routing of traffic is resumed.
Referring now to
Various operations of the system 100 may also utilize so-called “address routed traffic” 140. For example, in one embodiment, I/O transactions that follow a route (such as 108, 116, 124, 130) make use of a steering mechanism that determines where the transaction unit (packet) will be sent at each step, which is determined by examining the Address field in the I/O transaction packet, which is a field that holds the memory or I/O address with which the transaction is associated.
Conversely, non-address routed traffic 142 may also be utilized within the system 100. Specifically, in one implementation, I/O transactions that follow a route (such as 108, 116, 124, 130) make use of a steering mechanism that determines where the transaction unit (packet) will be sent at each step which is not determined using the Address field of the transaction unit (packet), but rather by some other means. Examples of non-address routed traffic would include, without limitation, READ completions or Configuration Space READ Completions in the PCIe protocol, which are routed using the Routing Identifier (an ID identifying the original sender of the READ or Configuration Space READ request transaction) in order to determine the routing at each route step in the return path to the original sender. These packets use different information from the Address to determine their routing mechanism.
It is noted that although
Additionally, in the exemplary embodiment of
As discussed previously herein, the presentation of virtualized device interfaces, e.g., the virtualization of a PCI family devices configuration space (PCI, PCI-X, PCIe, CardBus, CXL, etc.) is known. In the context of the present disclosure, the term VDI (Virtual Device Interface) is used to describe the virtualized configuration space.
Despite recent advances in switch technology, current designs do not allow for the migration of hardware from one host system to another without a reboot of the host systems, aka “cold migration.”
Hot migration of a VM is the migration of a VM (or workload on a host) from one system to another without a perceptible loss in user function (typically for 1 second or less). Hot migration of a VM with direct hardware access requires the transfer in ownership and access of the direct hardware from a source host system (with source VM) to a destination host system, with destination VM instantiation (a hot migration of both the VM, and the associated direct access hardware that the VM is using.)
So-called Direct-Hardware-Access VM Hot Migration (DHA VM Hot Migration) thus involves the transfer of ownership from the source host (who has original direct access) to another destination host (who obtains final direct access, as the source host eventually loses access at the completion of the migration process). The two hosts may or may not feature hypervisor systems on them supporting VMs. For DHA VM Hot Migration, VM and hypervisors are present on the hosts, and the source host 120 (
Advantageously, the various aspects of the present disclosure provide for, inter alia, such hot-migration of device ownership between e.g., a source system, to a destination system, including the exemplary specific case of Direct-Hardware-Access VM migration, which for the two hosts involved is just a subset of overall device ownership usage for each of the two host systems. Moreover, aspects of the disclosure provide for the sub-ownership (or sub-leased ownership) of a VM in a source system, and a VM in a destination system. The concepts introduced here apply to, inter alia, any protocol that utilizes an ID-based register address space (such as that found in PCI, PCIe, CXL, and many other protocols), and hence the exemplary PCIe-based variants described herein are purely illustrative of the broader concepts.
It is noted that unlike some extant approaches, the exemplary configurations of the present disclosure are not restricted to only originating and destination Virtual Device Instantiations only existing in mutual exclusion (i.e., never at the same time). This presents a significant improvement, overcoming the limitations that (i) only one host can see and own the VDI, and (ii) there can only be one VDI in existence for a real physical device at one time. In contrast, the present disclosure allows and makes use of two or more simultaneous VDIs (e.g., virtual configuration space(s)) being exposed, e.g., one to the originator (e.g., source host), and a second one to the destination (e.g., destination host). In exemplary implementations described herein, the originating and destination Virtual Device Instantiations exist for at least a period of time simultaneously. Specifically, in one variant, the destination Virtual Device Instantiation comes into being (is detectable or accessible) while at the same time, the previously created originating Virtual Device Instantiation remains in existence (advantageously allowing the originating Physical Host to perform at least some detection and access operations during that time period, while also allowing the destination Physical Host(s) to use the Destination Virtual Device Instantiation to perform at least some detection and access operations). Extant solutions preclude such simultaneous operation of originating and destination Virtual Machine Device Instantiations.
It will also be appreciated that in some embodiments, the destination VDI may only be partial (e.g., allow only some operations). Another benefit of the capability of the present disclosure whereby both the source and destination VDI interfaces exist simultaneously is the substantial reduction in the time that I/O traffic must be halted during migration. The destination (e.g., destination Physical Host) may proceed with migration setup operations of that Physical Host's resident operating system, while at the same time, the originator (e.g., originating VDI) is still operating allowing the originating VDI to service the originating VM and originating Physical Host without I/O traffic yet being paused.
Destination host operations include (but are not limited to) processing device hot-add events (e.g., those of the Destination VDI), detecting and enumerating the destination VDI instance during normal operating system processing, and of allowing host operating system operations that may configure integrated or hosted virtual machine operations to a destination VM instance (e.g., the copy of the VM on the destination host, on the destination base OS, running on the destination Physical Host) for the purpose of “plumbing” and routing setup in the host OS; e.g., establishing memory translations, creating appropriate IOMMU (Input Output Memory Management Unit) entries in the Destination host, or performing any other destination host software operations that require the detection and access of the destination VDI.
Originating host operations include, for example, all I/O operations, as the originating host typically continues to operate for a relatively longer period of time; e.g., until the final paused I/O sequence used to change ongoing I/O operations from the source host to the destination host (the route change). Allowing the originating VDI to operate while the destination VDI is in existence allows, inter alia, the originating VM to continue to operate while the destination host is creating the plumbing (such as destination VDI discovery, enumeration, driver loading, driver quiescing, hypervisor handoff, IOMMU entry creation (which can be dependent on Destination VDI identity discovery through enumeration)), while the originating VM continues to operate. This allows the time period when the originating VM must finally be frozen (and can no longer operate), but the destination VM is not yet ready to be operated, to be greatly shortened because the substantial initializing, setup, IOMMU entry, driver loading, driver pausing activities may either be completed, or at least substantially completed, before having to pause the originating VM in preparation for handoff migration to the copied and synchronized destination VM. The use of the destination VDI that exists at the same time as the originating VDI allows the destination host to complete substantial migration transition initialization activities while the originating host continues to operate, and thus diminishes the time during the migration when the VM must be offline and not performing active work or responding to external events. Minimizing the VM downtime in such migrations is thus one highly valuable aspect of the present disclosure.
As shown in
Next, per step 204, a destination-side process is created or instantiated while the originator-side process of step 202 remains in existence (i.e., is accessible or can be detected/seen by e.g., a corresponding host).
Per step 206, the originator-side process and the destination-side process are maintained in co-existence for a period of time, during which one or more operations are performed. As previously described herein, such operation(s) may include e.g., operations conducted by the destinations-side/process in preparation for VM migration, thereby shortening or reducing a time or gap between when the originating-side process (e.g., VDI) must be paused and when the destination-side VDI is fully ready to operate.
Among other things, the present disclosure describes pause mechanisms useful in e.g., the VM migration process.
In one implementation of such mechanisms, physical endpoint device functions are maintained unaware of traffic re-routing within the I/O fabric due to e.g., downstream (from host to device I/O) traffic having stopped in a particular sequence, making the host or host virtual machine (I/O traffic source and sync) unaware of the operation. This approach has the effect that the I/O device believes that the pause is ascribed to another cause; e.g., for flow control reasons only. This allows traffic to be paused for a period of time, so as to carry out a I/O traffic routing change. Such traffic routing changes are necessary in VM migration, and useful in other situations such as where a VM needs to be paused, or other host operations need to be paused. In this description, I/O traffic from the host to the device is referred to as downstream traffic, while I/O traffic from the device to the host server system is referred to as upstream traffic, although these terms are merely for purposes of illustration and in no way limiting on the broader concepts described herein.
At a high level, the exemplary methodology described herein seeks to first pause downstream traffic within the fabric at e.g., a blocking point (the downstream blocking point), and allowing downstream commands that have already passed this blocking point to drain downstream to the physical endpoint device function 132. In the case of commands that require immediate (or near immediate) responses, the physical endpoint device function 132 is able to complete (and thus drain) such commands from the outstanding set from the host such that only I/O traffic commands that require completions that have not passed the downstream blocking point are outstanding. This approach allows all completion type traffic (for instance in the exemplary PCIe I/O protocol, memory read completions, configuration cycle read completions) for I/O traffic commands that have passed the downstream blocking point may complete in the upstream direction before upstream traffic is blocked (at the upstream blocking point) within the I/O fabric.
Additionally, the physical endpoint device function 132 is allowed to keep sending upstream traffic that is not completion-based upstream (from device to host) at its leisure until such time as protocol flow control prevents the device from transmitting upstream (only because flow control conditions prevent it from doing so, as described above), and all return device-to-host completion traffic for the corresponding downstream I/O command flows is drained. Eventually, such traffic would not include completion traffic, because such completion traffic was allowed to complete before blocking the upstream traffic at the fabric upstream blocking point for this physical endpoint device function. This mechanism advantageously allows, inter alia, the physical endpoint device function 132 to be unaware of any pausing, quiescing, or identity change that occurs between its communication.
Referring now to
It is also noted in passing that in the exemplary embodiment, the destination host in such case normally does not have any traffic, because the destination host at this point is unaware of the real hardware, is unaware of the destination VDI (until that point in the sequence when the destination VDI is created), and the Destination VM is not yet aware of the destination VDI, nor the real endpoint.
Next, per step 304 of the method 300, all downstream I/O transactions are completed to the device (endpoint function). In one implementation, downstream transactions that require a completion are completed by the I/O device and sent to the still active upstream link. At this point, both directions of I/O traffic are still active, only the software execution has been stopped. Protocol completion traffic (such as READ responses, and configuration READ responses) complete nearly immediately, and a very short wait in sequence control software (separate from the I/O source/sync software that has been stopped) facilitates the guarantee that all upstream completion traffic in response to downstream I/O command traffic completes.
At the completion of step 304, all downstream traffic and responsive upstream completions are now complete.
Per step 306, upstream traffic is then paused in the traffic network (not in the device itself). Any means of network pause may be used consistent with the present disclosure. For instance, in one implementation, a PCIe switch's ability to pause traffic on a particular link/channel is used as the basis of pausing the traffic. This type of traffic pause or blocking is observed at the I/O device as I/O protocol flow control blockage. The device will stop only because it is blocked by the flow control mechanism of the I/O link. It is noted that in some PCIe-based implementations, because the PCIe link credits are exhausted at the device, the device will stop until more credits are available using the flow control mechanism.
Per step 308, any upstream device-originated bus mastered DMA read and write operations that have not passed the host downstream traffic block pause point (134 or 138, described supra) are paused at the pause-point, such that no traffic passes the pause point. If flow control mechanisms allow traffic to be delivered prior to the pause point, it continues to be delivered until it reaches the pause point, where it stalls because of the pause condition. In one embodiment, using a PCIe fabric, the upstream traffic will continue to flow to the device upstream location pause point 136 (
Per step 310, any upstream device-originated bus mastered DMA write operations that have passed the block point continue to the host and are completed.
Per step 312, any upstream device-originated bus mastered DMA read (contrast: write) operations that have passed the block point continue to the host, and result in host to device completions in the downstream direction. These cycles are not yet blocked, and so they complete from the host to the device, and the device sees such cycles as satisfied and completed.
It is noted that in the exemplary embodiment described herein (i.e., in the context of a PCIe-based system), for upstream device-originated cycles behind the block point, completion timers will still apply and run in the device, as do acknowledgment (ACK) timers.
Referring again to
The memory image in the host is now safe to collect, as it is frozen since I/O traffic in both directions is now blocked. In the exemplary implementation, a few microseconds of delay are needed to allow the pre-block traffic to complete in the process shown in earlier steps of the method 300. For a VM migration application, the VM image is now collected at this point in time per step 314, or any other point before traffic is restarted.
Per step 316, one or more I/O routes can now be changed between the two blockage points as required for a migration ownership route change. For instance, traffic will be changed from the Source Route (108 of
The destination host completes its preparations for receiving inbound traffic preparations at this point. For instance, in the exemplary application where the host software migration is a VM migration, the target (destination) VM completes its IOMMU setup, physical page mapping, and most importantly, its IOVA setup in the IOMMU subsystem such that the IOVA's valid on the original host mapping for the VM are still valid on the new destination mapping of the VM on the new destination host in the VM migration. It will be appreciated that earlier ongoing synchronization preparation(s) can have occurred at any prior step, but finalization of VM preparation must occur before I/O traffic is unblocked and resumed.
Next, per step 318 of
Per step 322, upstream and downstream traffic that was blocked in the fabric (behind the blocking points) as previously described now completes.
Similarly, new traffic from the device to the new destination host can transit and complete (step 324).
New downstream traffic from the destination (new) host to the device can now transit and complete (step 326).
Lastly, per step 328, activity in the system is fully resumed.
It will be appreciated that while the foregoing methodology is described as a sequence of steps, two or more of the steps may be performed, whether in whole or part, in parallel with one another, as applicable. Moreover, the order of certain steps can be permuted consistent with achieving the goals and functionality described herein. Hence, the foregoing methodology is merely but one particular implementation of the broader principles and methodologies of the disclosure.
Moreover, it will be recognized that certain steps may be rendered unnecessary or optional under certain circumstances. Additionally, certain steps or functionality may be added to the disclosed embodiments. Furthermore, features from two or more of the methods may be combined. All such variations are considered to be encompassed within the disclosure disclosed and claimed herein.
The description set forth herein, in connection with the appended drawings, describes example configurations and does not represent all the examples that may be implemented or that are within the scope of any existing or later-added claims. The term “exemplary” used herein means “serving as an example, instance, or illustration,” and not “preferred” or “advantageous over other examples.”
The detailed description includes specific details for the purpose of providing an understanding of the described techniques. These techniques, however, may be practiced without these specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the concepts of the described examples.
Information and signals described herein may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
While the above detailed description has shown, described, and pointed out novel features of the disclosure as applied to various embodiments, it will be understood that various omissions, substitutions, and changes in the form and details of the device or process illustrated may be made by those skilled in the art without departing from the disclosure. This description is in no way meant to be limiting, but rather should be taken as illustrative of the general principles of the disclosure. The scope of the disclosure should be determined with reference to the claims.
It will be further appreciated that while certain steps and aspects of the various methods and apparatus described herein may be performed by a human being, the disclosed aspects and individual methods and apparatus are generally computerized/computer-implemented. Computerized apparatus and methods are necessary to fully implement these aspects for any number of reasons including, without limitation, commercial viability, practicality, and even feasibility (i.e., certain steps/processes simply cannot be performed by a human being in any viable fashion).
The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software executed by a processor, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable apparatus (e.g., storage medium). Computer-readable media include both non-transitory computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A non-transitory storage medium may be any available medium that can be accessed by a general purpose or special purpose computer.
Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, cloud entity, cluster, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave are included in the definition of medium.
Disk and disc, as used herein, include CD, laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above are also included within the scope of computer-readable media.
This application claims priority to co-owned and co-pending U.S. Provisional Patent Application Ser. No. 63/351,770 of the same title, filed Jun. 13, 2022, which is incorporated herein by reference in its entirety. Additionally, this application is generally related to, and/or is useful with, one or more aspects of, subject matter contained in: co-owned U.S. patent application Ser. No. 16/566,829 filed Sep. 10, 2019, entitled “Methods and Apparatus for High-Speed Data Bus Connection and Fabric Management,” and issued as U.S. Pat. No. 11,593,291 on Feb. 28, 2023; U.S. patent application Ser. No. 17/079,288 filed Oct. 23, 2020, entitled “Methods and Apparatus for DMA Engine Descriptors for High Speed Data Systems,” and issued as U.S. Pat. No. 11,392,528 on Jul. 19, 2022; U.S. patent application Ser. No. 17/061,366 filed Oct. 1, 2020, entitled “Method and Apparatus for Fabric Interface Polling,” and issued as U.S. Pat. No. 11,593,288 on Feb. 28, 2023; U.S. patent application Ser. No. 17/016,228 filed Sep. 9, 2020, entitled “Methods and Apparatus for Network Interface Fabric Send/Receive Operations,” and issued as U.S. Pat. No. 11,403,247 on Aug. 2, 2022; and co-pending U.S. patent application Ser. No. 17/016,269 filed Sep. 9, 2020, entitled “Methods and Apparatus for Improved Polling Efficiency in Network Interface Fabrics,” each of the foregoing incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63351770 | Jun 2022 | US |