Benefit is claimed under 35 U.S.C. 119(a)-(d) to Foreign Application Serial No. 202141031602 filed in India entitled “MIGRATION OF VIRTUAL COMPUTE INSTANCES USING REMOTE DIRECT MEMORY ACCESS”, on Jul. 14, 2021, by VMware, Inc., which is herein incorporated in its entirety by reference for all purposes.
The ability to migrate running instances of virtual machines (VMs) between host computers is a fundamental advantage of virtual machines over physical machines. Various advancements have been achieved in VM migration technology including live migration, which is described in U.S. Pat. No. 7,484,208. In addition, different forms of VM migration have been practiced. For example, in U.S. Pat. No. 6,795,966, a high availability virtual machine cluster is provided in which a virtual machine is transitioned from one host computer to another host computer using a shared storage system that maintains a representation of the virtual machine state.
The technology described in U.S. Pat. No. 6,795,966 is employed in situations where a host computer has failed and protected VMs running in the failed host computer are recovered in another host. However, failures are often abrupt and result in data loss because there is not sufficient time for the host computers to update the representation of the virtual machine state to the most current state. Consequently, the recovered VMs are restored to an earlier state of the VM, e.g., the most recent checkpointed state, than the current state.
Embodiments provide an improved technique for migrating VMs (more generally referred to as virtual compute instances) between host computers. This technique employs remote direct memory access (RDMA) to transfer the entire state of a VM residing in system memory of a source host computer to system memory of a destination host computer. Because the technique employs RDMA, the state of the VM in system memory may be transferred even after failure of system software running in the source host computer. As a result, the VM may be recovered on the destination host computer without any data loss even when the system software running in the source host computer crashes.
In the embodiments described below, migration of VMs is described in the context of failover in a high availability virtual machine cluster, where protected VMs running in a failed host computer are recovered in a failover host computer. In such an example, the source host computer is the failed host computer and the destination host computer is the failover host computer, and migration is carried out by suspending the VM in the source host computer and resuming it in the destination host computer. However, embodiments may be practiced in other situations, e.g., in non-high-availability contexts where both the source host computer and the destination host computer are operational.
In the embodiments, NICs 108 include functionality to support RDMA transport protocols, e.g., RDMA over Converged Ethernet (RoCE) and Wide Area RDMA Protocol (iWARP), in addition to other transport protocols, such as TCP. Such RDMA-enabled NICs are commercially available from hardware vendors, such as Mellanox Technologies, Inc. and Chelsio Communications.
A virtualization software layer, referred to hereinafter as hypervisor 111, is installed on top of hardware platform 102. Hypervisor 111 makes possible the concurrent instantiation and execution of one or more VMs 1181-118N. The interaction of a VM 118 with hypervisor 111 is facilitated by the virtual machine monitors (VMMs) 134. Each VMM 1341-134N is assigned to and monitors a corresponding VM 1181-118N. In one embodiment, hypervisor 111 may be a hypervisor implemented as a commercial product in VMware's vSphere® virtualization product, available from VMware Inc. of Palo Alto, Calif. In an alternative embodiment, hypervisor 111 runs on top of a host operating system which itself runs on hardware platform 102. In such an embodiment, hypervisor 111 operates above an abstraction level provided by the host operating system.
After instantiation, each VM 1181-118N encapsulates a virtual hardware platform that is executed under the control of hypervisor 111, in particular the corresponding VMM 1341-134N. For example, virtual hardware devices of VM 1181 in virtual hardware platform 120 include one or more virtual CPUs (vCPUs) 1221-122N, a virtual random access memory (vRAM) 124, a virtual network interface adapter (vNIC) 126, and virtual HBA (vHBA) 128. Virtual hardware platform 120 supports the installation of a guest operating system (guest OS) 130, on top of which applications 132 are executed in VM 1181. Examples of guest OS 130 include any of the well-known commodity operating systems, such as the Microsoft Windows® operating system, the Linux® operating system, and the like.
It should be recognized that the various terms, layers, and categorizations used to describe the components in
In the embodiments, a plurality of host computers (also referred to simply as “hosts”), each configured in the manner illustrated for computer system 100, is managed as a cluster by a VM management server 210 to provide cluster-level functions, such as load balancing across the cluster by performing VM migration between the hosts, distributed power management, dynamic VM placement according to affinity and anti-affinity rules, and high availability (HA). VM management server 210 also manages shared storage 220 to provision storage resources for the cluster.
Failed host 201 represents a host that has failed, e.g., as a result of system software (e.g., hypervisor 111) crash. Failover host 202 represents a host in which protected VMs (which are VMs designated for high availability and depicted in
In the embodiments, RDMA-enabled NICs transfer data directly between system memory of hosts without involving the system software of either host. In general, RDMA implementations provide several communication primitives (so called “verbs”) that can be categorized into the following two classes: (1) one-sided and (2) two-sided verbs. One-sided RDMA verbs (READ/WRITE) provide remote memory access semantics, in which the host (which is the failover host in the embodiments) specifies the memory address of the remote node (which is the failed host in the embodiments) that should be accessed. When using one-sided verbs, the CPU of the remote node is not actively involved in the data transfer. Two-sided verbs (SEND/RECEIVE) provide channel semantics. In order to transfer data between a host and a remote node, the remote node first needs to publish a RECEIVE request before the host can transfer the data with a SEND operation. In contrast to one-sided verbs, the host does not specify the target remote memory address. Instead, the remote host defines the target address in its RECEIVE operation. Consequently, by posting the RECEIVE, the remote CPU is actively involved in the data transfer.
Embodiments employ one-sided RDMA verbs, in particular one-sided RDMA READ, hereinafter referred to as a single-sided RDMA operation. To do so, a memory transfer region is configured in each host when the host is booted up. This memory transfer region has a fixed virtual address space, such that the mapping between the virtual addresses and the physical addresses in this memory transfer region are fixed. When VMs are powered-on (i.e., instantiated), hypervisor 111 creates an in-memory file system for each of the VMs in this memory transfer region, and communicates with other hosts in the cluster to create RDMA queue pairs. An RDMA queue pair includes a send queue and a receive queue. The send queue includes a pointer to a memory region from which data are sent and the receive queue includes a pointer to a memory region into which data will be received. For example, when a VM is instantiated in a host, a pointer to the in-memory file system that the hypervisor created for the VM and from which data will be sent will be placed in the send queue, and in each of the other hosts in the cluster, a pointer to the memory region for receiving the data will be placed in the receive queue. Accordingly, multiple queue pairs are created in the cluster each time a VM is instantiated.
In
When host 201 fails (e.g., as a result of crash of hypervisor 111), host 201 executes a panic code to suspend the protected VMs of host 201, e.g., VM1 and VM2, and copy page tables of the protected VMs into their respective in-memory file systems. The copying of the VM1 pages tables into memory region 231 is depicted with an arrow 251 and the copying of the VM2 pages tables into memory region 232 is depicted with an arrow 252. After the page tables have been copied into memory regions 231, 232, NIC 108 of host 202, which represents the failover host, performs a single-sided RDMA read operation with reference to the established queue pairs to transfer the contents of memory region 231 into memory region 241 (as depicted by arrow 253) without involving the CPU of host 201 and to transfer the contents of memory region 232 into memory region 242 (as depicted by arrow 254) without involving the CPU of host 201. As a result, the VM1 page tables and the VM2 pages tables are now resident in memory regions of host 202.
After the page tables have been copied over, NIC 108 of host 202 performs additional single-sided RDMA read operations to transfer data pages of VM1 and VM2 from their locations in system memory of host 201 to the memory transfer region of host 202 as depicted by arrows 255 and 256. The single-sided RDMA read operations specify the locations of the data pages of VM1 in the system memory of host 201 determined from the VM1 page tables transferred into memory region 241 and the locations of the data pages of VM2 in the system memory of host 201 determined from the VM2 page tables transferred into memory region 242. After all contents of the data pages of VM1 have been transferred into the memory transfer region of host 202, the hypervisor of host 202 copies them into new locations in system memory of host 202 as depicted by arrow 263, and reconstructs the page tables of VM1 to reference the new locations in system memory of host 202 into which the data pages of VM1 have been copied. The reconstructed page tables of VM1 are then written to memory region 261. Similarly, after all contents of the data pages of VM2 have been transferred into the memory transfer region of host 202, the hypervisor of host 202 copies them into new locations in system memory of host 202 as depicted by arrow 264, and reconstructs the page tables of VM2 to reference new locations in system memory of host 202 into which the data pages of VM2 have been copied. The reconstructed page tables of VM2 are then written to memory region 262.
In response to the notification sent by the failed host at step 304, the VM management server at step 320, selects one of the other hosts of the cluster as a failover host, i.e., the host in which the protected VMs in the failed host are to be recovered. At step 322, the VM management server instructs the failover host to recover the protected VMs and transmits the configuration data of the protected VMs in the failed host to the failover host. The configuration data provides identifying information for the protected VMs and the storage provisioned for the protected VMs in shared storage 220, and also specifies resource requirements for the protected VMs.
Upon receipt of instruction to recover the protected VMs, the failover host executes steps 340, 342, 344, 346, 348, 350, 352, and 354 for each of the protected VMs. At step 340, the failover host instantiates the protected VMs using the configuration data provided by the VM management server. Then, at step 342, the failover host confirms that the protected VM has been suspended (e.g., by performing a single-sided RDMA read operation on the data structure in the system memory of the failed host that tracks the suspended state of the protected VMs). After confirming that the protected VM has been suspended, the failover host at step 344 performs a single-sided RDMA read operation with reference to the established queue pairs to transfer the page tables of the protected VM from the memory transfer region of the failed host to the memory transfer region of the failover host, without involving the CPU of the failed host. After the page tables have been copied over, the failover host at step 346 performs additional single-sided RDMA read operations to transfer data pages of the protected VM from the system memory of the failed host to its memory transfer region and then copies the transferred data pages into free locations in its system memory. After all contents of the data pages of the protected VM have been transferred and copied into new locations in its system memory, the failover host at step 348 reconstructs the page tables of the protected VM to reference the new locations in the system memory thereof into which the data pages of the protected VM have been copied, and at step 350 writes the reconstructed page tables to the system memory thereof. Then, at step 352, the failover host notifies the failed host that the protected VM has been recovered. The process on the failover host side ends when all protected VMs have been recovered (step 354; Yes).
Certain embodiments as described above involve a hardware abstraction layer on top of a host computer. The hardware abstraction layer allows multiple contexts to share the hardware resource. In one embodiment, these contexts are isolated from each other, each having at least a user application running therein. The hardware abstraction layer thus provides benefits of resource isolation and allocation among the contexts. In the foregoing embodiments, virtual machines are used as an example for the contexts and hypervisors as an example for the hardware abstraction layer. As described above, each virtual machine includes a guest operating system in which at least one application runs. It should be noted that these embodiments may also apply to other examples of contexts, such as containers not including a guest operating system, referred to herein as “OS-less containers” (see, e.g., www.docker.com). OS-less containers implement operating system-level virtualization, wherein an abstraction layer is provided on top of the kernel of an operating system on a host computer. The abstraction layer supports multiple OS-less containers, each including an application and its dependencies. Each OS-less container runs as an isolated process in user space on the host operating system and shares the kernel with other containers. The OS-less container relies on the kernel's functionality to make use of resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces and to completely isolate the application's view of the operating environments. By using OS-less containers, resources can be isolated, services restricted, and processes provisioned to have a private view of the operating system with their own process ID space, file system structure, and network interfaces. Multiple containers can share the same kernel, but each container can be constrained only to use a defined amount of resources such as CPU, memory, and I/O.
Certain embodiments may be implemented in a host computer without a hardware abstraction layer or an OS-less container. For example, certain embodiments may be implemented in a host computer running a Linux® or Windows® operating system.
The various embodiments described herein may be practiced with other computer system configurations, including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
One or more embodiments of the present invention may be implemented as one or more computer programs or as one or more computer program modules embodied in one or more computer-readable media. The term computer-readable medium refers to any data storage device that can store data which can thereafter be input to a computer system. Computer-readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer. Examples of a computer-readable medium include a hard drive, network-attached storage (NAS), read-only memory, random-access memory (e.g., a flash memory device), a CD (Compact Discs) —CD-ROM, a CDR, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The computer-readable medium can also be distributed over a network-coupled computer system so that the computer-readable code is stored and executed in a distributed fashion.
Although one or more embodiments of the present invention have been described in some detail for clarity of understanding, it will be apparent that certain changes and modifications may be made within the scope of the claims. Accordingly, the described embodiments are to be considered as illustrative and not restrictive, and the scope of the claims is not to be limited to details given herein but may be modified within the scope and equivalents of the claims. In the claims, elements and/or steps do not imply any particular order of operation unless explicitly stated in the claims.
Plural instances may be provided for components, operations, or structures described herein as a single instance. Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the appended claim(s).
Number | Date | Country | Kind |
---|---|---|---|
202141031602 | Jul 2021 | IN | national |