During the past decade, there has been tremendous growth in the usage of so-called “cloud-hosted” services. Examples of such services include e-mail services provided by Microsoft® (Hotmail/Outlook online), Google® (Gmail) and Yahoo® (Yahoo mail), productivity applications such as Microsoft® Office 365 and Google Docs, and Web service platforms such as Amazon® Web Services (AWS) and Elastic Compute Cloud (EC2) and Microsoft® Azure. Cloud-hosted services are typically implemented using data centers that have a very large number of compute resources, implemented in racks of various types of servers, such as blade servers filled with server blades and/or modules and other types of server configurations (e.g., 1U, 2U, and 4U servers).
In recent years, virtualization of computer systems has also seen rapid growth, particularly in server deployments and data centers. Under a conventional approach, a server runs a single instance of an operating system directly on physical hardware resources, such as the CPU, RAM, storage devices (e.g., hard disk), network controllers, input-output (IO) ports, etc. Under one virtualized approach using Virtual Machines (VMs), the physical hardware resources are employed to support corresponding instances of virtual resources, such that multiple VMs may run on the server's physical hardware resources, wherein each virtual machine includes its own CPU allocation, memory allocation, storage devices, network controllers, IO ports etc. Multiple instances of the same or different operating systems then run on the multiple VMs. Moreover, through use of a virtual machine manager (WM) or “hypervisor” or “orchestrator,” the virtual resources can be dynamically allocated while the server is running, enabling VM instances to be added, shut down, or repurposed without requiring the server to be shut down. This provides greater flexibility for server utilization, and better use of server processing resources, especially for multi-core processors and/or multi-processor servers.
Under another virtualization approach, container-based OS virtualization is used that employs virtualized “containers” without use of a VMM or hypervisor. Instead of hosting separate instances of operating systems on respective VMs, container-based OS virtualization shares a single OS kernel across multiple containers, with separate instances of system and software libraries for each container. As with VMs, there are also virtual resources allocated to each container.
Para-virtualization (PV) is a virtualization technique introduced by the Xen Project team and later adopted by other virtualization solutions. PV works differently than full virtualization—rather than emulate the platform hardware in a manner that requires no changes to the guest operating system (OS), PV requires modification of the guest OS to enable direct communication with the hypervisor or VMM. PV also does not require virtualization extensions from the host CPU and thus enables virtualization on hardware architectures that do not support hardware-assisted virtualization. PV IO devices (such as virtio, vmxnet3, netvsc) have become the de facto standard of virtual devices for VMs. Since PV IO devices are software-oriented devices, they are friendly to cloud criteria like live migration.
While PV IO devices are cloud-ready, their IO performance is poor relative to solutions supporting IO hardware pass-through VFs (virtual functions), such as single-root input/output virtualization (SR-IOV). However, pass-through methods such as SR-IOV (Single-root Input/Output Virtualization) have a few drawbacks. For example, when performing live migration, the hypervisor/VMM is not aware of device stats that are passed through to the VM and transparent to the hypervisor/VMM.
In some virtualized environments, a mix of different operating system types from different vendors are implemented. For example, Linux-based hypervisors and VMs (such as Kernel-based Virtual Machines (KVMs) and Xen hypervisors) dominate cloud-based services provided by AWS and Google®, and have been recently added to Microsoft® Azure. At the same time, Microsoft® Windows operating systems dominates the desktop application market. With the ever-increasing use of cloud-hosted Windows applications, the use of Microsoft® Windows guest operating systems hosted on Linux-based VMs and hypervisors has become more common. However, on KVM and Xen hypervisors, the ability to live migrate a virtual machine hosting a Windows guest OS that has an SR-IOV VF attached to it has yet to be supported.
The foregoing aspects and many of the attendant advantages of this invention will become more readily appreciated as the same becomes better understood by reference to the following detailed description, when taken in conjunction with the accompanying drawings, wherein like reference numerals refer to like parts throughout the various views unless otherwise specified:
Embodiments of methods, software, and apparatus for implementing live migration of virtual machines hosted by a Linux OS and running a Windows OS are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases “in one embodiment” or “in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
For clarity, individual components in the Figures herein may also be referred to by their labels in the Figures, rather than by a particular reference number. Additionally, reference numbers referring to a particular type of component (as opposed to a particular component) may be shown with a reference number followed by “(typ)” meaning “typical.” It will be understood that the configuration of these components will be typical of similar components that may exist but are not shown in the drawing Figures for simplicity and clarity or otherwise similar components that are not labeled with separate reference numbers. Conversely, “(typ)” is not to be construed as meaning the component, element, etc. is typically used for its disclosed function, implement, purpose, etc.
In accordance with aspects of the embodiments disclosed herein, solutions are provided that support live migration of virtual machines hosted by a Linux OS and running instances of Windows OS that have an SR-IOV VF attached to them. The solutions are implemented, in part, with software components implementing standard Windows interfaces and drivers to support compatibility with existing and future versions of Windows operating systems. Communication between the VM and a network device is implementing using a virtual function (VF) datapath coupled to an SR-IOV VF on the network device and an emulated interface coupled to a physical function on the network device, wherein the emulated interface employs software components in a Linux Hypervisor, and the VF datapath is a pass-through VF that bypasses the Hypervisor. In preparation for live migration, the active datapath used by the source VM on the source host is switched from the VF datapath to the emulated datapath, enabling the Hypervisor on the source host to track dirtied memory pages during the live migration. After an initial post migration state following the live migration during which the emulated datapath is used by the destination VM on the destination host, the active datapath for the destination VM is returned to the VF datapath.
In one embodiment, architecture 100 is implemented using a Linux KVM (Kernel-based Virtual Machine) architecture, such as illustrated in Linux KVM architecture 200 of
As illustrated in
Returning to
NIC 108 depicts components and interfaces that are representative of a NIC or similar network device, network adaptor, network interface, etc., that provides SR-IOV functionality, as is abstractly depicted by a PCIe (Peripheral Component Interconnect Express) endpoint with SR-IOV block 134. SR-IOV block 134 includes virtual functions 136 and 138, a physical function 140, and a virtual Ethernet bridge and classifier 142. NIC 108 further includes a pair of ports 144 and 146 (also referred to as Port 0 and Port 1) coupled to an internal switch with Tx (transmit) and Rx (receive) buffers 148 that is also coupled to PCIe endpoint with SR-IOV block 134. In the illustrated embodiment, NIC 108 is implemented as a PCIe expansion card installed in an PCIe expansion slot installed in the host platform or a PCIe component or daughterboard coupled to the main board of the host platform.
Network communication between VMs 104 and 106 and NIC 108 may be implemented using two datapaths, one of which is active at a time. Under a first datapath referred to as the “emulated” datapath, the datapath is from Miniport interface 128 to protocol driver 130, NetKVM driver 120, VirtIO-Net 110, SW switch 112, and PF driver 114 to physical function 140. This emulated datapath is a virtualized software-based datapath that is implemented via the aforementioned software components in the Hypervisor and VMs. When the emulated datapath is used, NIC 108 is referred to as the NetKVM device.
Under a second datapath referred to as the VF datapath, communication between a VM 104 or 106 and NIC 108 employs direct memory access (DMA) data transfers over PCIe in a manner that bypasses the Hypervisor (or otherwise is referred to as passing through the Hypervisor or a pass-through datapath). This VF datapath includes the datapath from Miniport interface 128 to protocol driver 132 to VF Miniport driver 122, and from VF Miniport driver 122 to virtual function 136 or 138 (depending on which VM 104 or 106 the communication originates from). When the VF datapath is used, NIC 108 is referred to as the VF device.
It is further noted that each of the emulated datapaths and VF datapaths illustrated herein are bi-directional, and that the complete datapaths between the VMs and a network (not shown) to which NIC 108 is connected includes datapaths internal to NIC 108. For example, an inbound packet (e.g., a packet received at port 144 or 146) will be buffered in an Rx buffer and subsequently copied to a queue (not shown) associated with one of virtual functions 136 and 138 or physical function 140, depending on the destination VM for the packet and whether the emulated datapath or VF datapath is currently active. In addition to the components shown, NIC 108 may including embedded logic or the like for implementing various operations such as but not limited to packet and/or flow classification and generated corresponding hardware descriptors. For outbound packets (e.g., packets originating from a VM and addressed to an external host or device accessed via a network), NIC 108 will include appropriate logic for forwarding packets from virtual functions 136 and 138 and physical function 140 to the appropriate outbound port (port 144 or 146). In addition, each of virtual functions 136 and 138 and physical function 140 will include one or more queues for buffering packets received from the VMs (not shown).
During platform run-time operations, it is preferred to employ the VF datapath when possible as this datapath provides substantial performance improvements over the emulated datapath since it involves no or little CPU overhead (in other words may be implemented without consuming valuable CPU cycles employed for executing software on the host). However, the VF datapath cannot be used during live migration since it provides no means to identify which memory pages are dirtied during the live migration, a function that is provided by the Hypervisor (e.g., as part of built-in functionality provided by Linux host/Hypervisor 102). Thus, a mechanism is needed to switch the VF datapath to the emulated datapath during live migration in a manner that is transparent to the Windows OS running on the VMs.
Under the live migration solution in
With this solution, prior to starting the live migration the VF device is hot-unplugged and MUX IM driver 118 will switch the datapath to the NetKVM device to use the NetKVM device during the live migration. Hence the live migration will continue with no impact to traffic that is running in the VM. Once the migration is complete, the VF instance is hot-added into the migrated-to VM and again put into the team with the MUX IM driver and then the traffic is resumed using the VF datapath.
In this solution, there is no need to change the existing Miniport drivers. The only modifications are the new MUX IM driver, and associated installation packages.
In further detail, the differences between platform architectures 100 and 300 are implemented in the instances of Windows OS 116 for architecture 100 and instances of Windows OS 117 in platform architecture 300, which are running on VMs 304 and 306. Each instance of Windows OS 117 employs a NetKVM/MUX IM driver 308 including a Miniport interface 310 and a protocol driver 312. Miniport interface 310 is coupled to protocol driver 312 via an internal communication path 314 and is coupled (via NetKVM/MUX IM driver 308 to VirtIO-Net backend 110 via an internal communication path 316. Protocol driver 312 is connected to VF Miniport driver 122.
Under host platform architecture 300, the emulated datapath is from Miniport interface 310 in NetKVM/MUX IM driver 308 to VirtIO-Net 110, to SW switch 112 to PF driver 114 to physical function 140. The VF datapath is from Miniport interface 310 to protocol driver 312 to VF Miniport driver 122 to virtual function 136 or 138.
A fundamental concept for NetKVM/MUX IM driver 308 in the enslave solution is to extend the NetKVM Miniport driver to include the MUX IM functionality discussed above. This extended NetKVM driver (NetKVM/MUX IM driver 308) has the ability to “enslave” the VF Miniport driver underneath—that is the VF Miniport driver is bound to the NetKVM driver. Like the MUX IM solution in
With this solution, the NetKVM/MUX IM driver will use the VF datapath during ongoing runtime operations for the VM. This is done by the protocol edge invoking the system protocol driver directly and bypassing the NetKVM Miniport driver. Then, to set up live migration the VF Miniport driver is ejected and the NetKVM/MUX IM can invoke the emulated VirtIO-Net datapath. This has the advantage of both simplifying the MUX IO driver and speeding up the transfer between the datapaths.
For both solutions, during installation the MUX IM or NetKVM/MUX IM driver will insert itself between the Windows protocol driver and the Miniport drivers. This is accomplished using a utility called notify object that is part of the new driver package.
In the default environment, all traffic will be directed to the VF Miniport driver as it is used by the VF datapath. When live migration starts, the VF device will be unplugged at some point, then the protocol edge of the MUX IM or NetKVM/MUX IM (in the enslave case), will be notified. This will result in switching the traffic to use the emulated VirtIO-Net datapath. At the later stage of the live migration, the SR-IOV VF and VF datapath are enabled again in the destination host, so the protocol driver for the VF Miniport driver will be notified again. The traffic will be resumed to use the VF Miniport driver (and the VF datapath). The switching of the datapath is done internally by the MUX IM driver or NetKVM/MUX IM, which is transparent to the upper layer system protocol driver (and thus also transparent to the operating systems running on the VMs).
As shown in a block 502 of flowchart 500 and
As shown in flowchart 500, in response to a migration event 503 the state of source VM 400 is switched to a pre-migration state under which the datapath on source host 402 is switched from VF datapath 418 to emulated datapath 420, as depicted by a block 504 and in
After the snapshot is taken, live migration may commence. Asynchronously, or otherwise as a part of the pre-migration operations, preparations are made on the destination host to launch (deploy) the destination VM, as depicted by a block 508. The initial launch state is shown in
As shown in a block 510, during the live migration the Hypervisor on the source host tracks dirtied memory pages—that is memory pages allocated to the VM being migrated that are written too during the live migration. Techniques for tracking dirty memory pages using Hypervisors are known in the art and the scheme that may be used is outside the scope of this disclosure. The net result is at the end of the live migration the Hypervisor will have a list of memory pages that have been dirtied during the migration, where the list represents a delta between the memory state of source VM 400 when the snapshot was taken and the memory state at the end of the migration.
At block 512, the state of source VM 400 and destination is restored on destination host 404 such that the states of VM 400 and VM 404 using the snapshot plus a copy of the dirtied memory pages (along with other state information) such that destination VM 404 matches the state of source VM 400 (i.e., the states are synched) Generally, various techniques that are known in the art may be performed for restoring the state of a source VM at the destination VM, with the scheme being outside the scope of this disclosure.
Once the states of the source and destination VMs are synched, the live migration is completed and the logic advances to a block 514 wherein operations (for Windows OS 117 and applications 124) are resume using the destination VM. The source VM is also suspended. This is referred to as the initial post migration state, which is depicted in
Following the initial post migration state is an ongoing runtime operations state after migration. As depicted in
A similar sequence to that shown in
Platform hardware 602 includes a processor 606 having a System on a Chip (SoC) architecture including a central processing unit (CPU) 608 with M processor cores 610, each coupled to a Level 1 and Level 2 (L1/L2) cache 612. Each of the processor cores and L1/L2 caches are connected to an interconnect 614 to which each of a memory interface 616 and a Last Level Cache (LLC) 618 is coupled, forming a coherent memory domain. Memory interface is used to access host memory 604 in which various software components are loaded and run via execution of associated software instructions on processor cores 610.
Processor 606 further includes an IOMMU (input-output memory management unit) 619 and an IO interconnect hierarchy, which includes one or more levels of interconnect circuitry and interfaces that are collectively depicted as IO interconnect & interfaces 620 for simplicity. In one embodiment, the IO interconnect hierarchy includes a PCIe root controller and one or more PCIe root ports having PCIe interfaces. Various components and peripheral devices are coupled to processor 606 via respective interfaces (not all separately shown), including a network device comprising a NIC 621 via an IO interface 623, a firmware storage device 622 in which firmware 624 is stored, and a disk drive or solid state disk (SSD) with controller 626 in which software components 628 are stored. Optionally, all or a portion of the software components used to implement the software aspects of embodiments herein may be loaded over a network (not shown) accessed, e.g., by NIC 621. In one embodiment, firmware 624 comprises a BIOS (Basic Input Output System) portion and additional firmware components configured in accordance with the Universal Extensible Firmware Interface (UEFI) architecture.
During platform initialization, various portions of firmware 624 (not separately shown) are loaded into host memory 604, along with various software components. In addition to host operating system 606 the software components include software components shown in architecture 200 of
NIC 621 includes one or more network ports 630, with each network port having an associated receive (RX) queue 632 and transmit (TX) queue 634. NIC 621 includes circuitry for implementing various functionality supported by the NIC, including support for operating the NIC as an SR-IOV PCIe endpoint similar to NIC 108 in
Generally, CPU 608 in SoC 606 may employ any suitable processor architecture in current use or developed in the future. In one embodiment, the processor architecture is an Intel® architecture (IA), including but not limited to an Intel® x86 architecture, and IA-32 architecture and an IA-64 architecture. In one embodiment, the processor architecture is an ARM®-based architecture.
Software components 627 may further including software components associated with an image of a Microsoft Windows OS, such as Windows 10, a version of Windows Server, etc. Alternatively, the Windows OS image may be stored on a storage device that is accessed over a network and loaded into a portion of host memory 604 allocated to the VM on which the Windows instance will be run. Following boot-up and instantiation of the Linux host OS and Hypervisor and one or more VMs, one or more instances of the Windows OS may be instantiated on a respective VM.
During instantiation of the Windows OS instance, the VF and emulated datapaths will be configured via execution of applicable instructions in the Windows OS and Linux OS. During states related to live migration (e.g., pre-migration, live migration, initial post migration etc.) the Hypervisor will selectively activate the VF or emulated datapath by controlling the state of the SR-IOV VF attached to the VM on which the Windows OS instance is running. It is noted that while migration of a single VM is described and illustrated above, similar operations may be implemented to live migrate multiple VMs.
In addition to implementations using SR-IOV, similar approaches to those described and illustrated herein may be applied to Intel® Scalable IO Virtualization (SIOV). SIOV may be implemented on various IO devices such as network controllers, storage controllers, graphics processing units, and other hardware accelerators. As with SR-IOV, SIOV devices and associated software may be configured to support pass-through DMA data transfers both from the SIOV device to the host and from the host to the SIOV device.
Although some embodiments have been described in reference to particular implementations, other implementations are possible according to some embodiments. Additionally, the arrangement and/or order of elements or other features illustrated in the drawings and/or described herein need not be arranged in the particular way illustrated and described. Many other arrangements are possible according to some embodiments.
In each system shown in a figure, the elements in some cases may each have a same reference number or a different reference number to suggest that the elements represented could be different and/or similar. However, an element may be flexible enough to have different implementations and work with some or all of the systems shown or described herein. The various elements shown in the figures may be the same or different. Which one is referred to as a first element and which is called a second element is arbitrary.
In the description and claims, the terms “coupled” and “connected,” along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Rather, in particular embodiments, “connected” may be used to indicate that two or more elements are in direct physical or electrical contact with each other. “Coupled” may mean that two or more elements are in direct physical or electrical contact. However, “coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. Additionally, “communicatively coupled” or “coupled in communication” means that two or more elements that may or may not be in direct contact with each other, are enabled to communicate with each other. For example, if component A is connected to component B, which in turn is connected to component C, component A may be communicatively coupled to component C using component B as an intermediary component.
An embodiment is an implementation or example of the inventions. Reference in the specification to “an embodiment,” “one embodiment,” “some embodiments,” or “other embodiments” means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments, of the inventions. The various appearances “an embodiment,” “one embodiment,” or “some embodiments” are not necessarily all referring to the same embodiments.
Not all components, features, structures, characteristics, etc. described and illustrated herein need be included in a particular embodiment or embodiments. If the specification states a component, feature, structure, or characteristic “may”, “might”, “can” or “could” be included, for example, that particular component, feature, structure, or characteristic is not required to be included. If the specification or claim refers to “a” or “an” element, that does not mean there is only one of the element. If the specification or claims refer to “an additional” element, that does not preclude there being more than one of the additional element.
Italicized letters, such as ‘M’, ‘N’, etc. in the foregoing detailed description and drawings are used to depict an integer number, and the use of a particular letter is not limited to particular embodiments. Moreover, the same letter may be used in separate claims to represent separate integer numbers, or different letters may be used. In addition, use of a particular letter in the detailed description may or may not match the letter used in a claim that pertains to the same subject matter in the detailed description.
As discussed above, various aspects of the embodiments herein may be facilitated by corresponding software and/or firmware components and applications, such as software and/or firmware executed by an embedded processor or the like. Thus, embodiments of this invention may be used as or to support a software program, software modules, firmware, and/or distributed software executed upon some form of processor, processing core or embedded logic a virtual machine running on a processor or core or otherwise implemented or realized upon or within a non-transitory computer-readable or machine-readable storage medium. A non-transitory computer-readable or machine-readable storage medium includes any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer). For example, a non-transitory computer-readable or machine-readable storage medium includes any mechanism that provides (i.e., stores and/or transmits) information in a form accessible by a computer or computing machine (e.g., computing device, electronic system, etc.), such as recordable/non-recordable media (e.g., read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.). The content may be directly executable (“object” or “executable” form), source code, or difference code (“delta” or “patch” code). A non-transitory computer-readable or machine-readable storage medium may also include a storage or database from which content can be downloaded. The non-transitory computer-readable or machine-readable storage medium may also include a device or product having content stored thereon at a time of sale or delivery. Thus, delivering a device with stored content, or offering content for download over a communication medium may be understood as providing an article of manufacture comprising a non-transitory computer-readable or machine-readable storage medium with such content described herein.
The operations and functions performed by various components described herein may be implemented by software running on a processing element, via embedded hardware or the like, or any combination of hardware and software. Such components may be implemented as software modules, hardware modules, special-purpose hardware (e.g., application specific hardware, ASICs, DSPs, etc.), embedded controllers, hardwired circuitry, hardware logic, etc. Software content (e.g., data, instructions, configuration information, etc.) may be provided via an article of manufacture including non-transitory computer-readable or machine-readable storage medium, which provides content that represents instructions that can be executed. The content may result in a computer performing various functions/operations described herein.
As used herein, a list of items joined by the term “at least one of” can mean any combination of the listed terms. For example, the phrase “at least one of A, B or C” can mean A; B; C; A and B; A and C; B and C; or A, B and C.
The above description of illustrated embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize.
These modifications can be made to the invention in light of the above detailed description. The terms used in the following claims should not be construed to limit the invention to the specific embodiments disclosed in the specification and the drawings. Rather, the scope of the invention is to be determined entirely by the following claims, which are to be construed in accordance with established doctrines of claim interpretation.