Distributed systems allow multiple clients in a network to access a pool of shared resources. For example, a distributed storage system, such as a distributed virtual storage area network (VSAN) datastore, allows a cluster of host computers (i.e., physical machines such as servers) to aggregate local physical disks (e.g., SSD, PCI-based flash storage, SATA, or SAS magnetic disks) located in or attached to each host computer to create a single and shared pool of storage. This pool of storage (sometimes referred to herein as a “datastore” or “store”) is accessible by all host computers in the cluster and may be presented as a single namespace of storage entities (such as a hierarchical file system namespace in the case of files, a flat namespace of unique identifiers in the case of objects, etc.). Storage clients operating on the host computers may use the datastore to store objects (e.g., virtual disks) that are accessed by virtual computing instances (VCIs), such as virtual machines (VMs), during their operations. An object is any data, structured or unstructured, and is a unit of storage of the datastore.
Each object stored in the datastore may include one or more components. A component may be a part of, or portion of, an object. In some cases, such as depending on the storage policy that is defined (e.g., by an administrator) for the object, mirrors (i.e., copies) of components of the object are stored on different hosts in the cluster. For example, a copy of a first component may be stored on both a first host and a second host of the cluster. Accordingly, should either of the first host or the second host become unavailable or inoperable, the component of the object is still available on the other of the first host or the second host, providing high availability of the object. In another example, if a storage policy for an object requires higher performance, different components of the object may be stored on different disks or hosts, such that I/O operations on the different components can occur in parallel on the different disks or hosts. For example, a first component of the object may be stored on a first disk and a second component of the object may be stored on a second disk. The different components of an object may also be referred to as “object components.”
In some cases, hosts of a cluster are distributed across multiple (e.g., at least two) fault domains (e.g., different physical sites, different portions of a network, etc.), and such a cluster is referred to as a “stretched cluster.” A fault domain is a set of hosts that share a single point of failure, such that if one host in the fault domain fails, there is a high correlation of other hosts in the same fault domain failing, but a low correlation of other hosts in other fault domains failing. A fault domain, therefore, may be defined as hosts that share the same physical location, same physical rack, same physical network, and/or other point of failure. In some cases, a fault domain may be a site, which may be a data center.
In certain cases, components of storage objects are replicated across multiple fault domains, such as to provide high availability. Such components replicated across multiple fault domains may be referred to as stretched components, and the corresponding objects may be referred to as stretched objects. For example, one or more components of a storage object may be stored on one or more hosts of a first fault domain as well as one or more hosts of a second fault domain. The stretched storage object may be associated with a VCI, such as where the stretched storage object is a virtual disk of a VCI. In some cases, a VCI accesses the copy of its storage object that resides in the fault domain where the VCI is currently running (the copy of the storage object in the same fault domain being referred to as “a site-local copy”).
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements disclosed in one embodiment may be beneficially utilized on other embodiments without specific recitation.
The present disclosure provides improved systems and methods for failure behavior for stretched clusters.
Each VCI running in a stretched cluster may be assigned a preferred fault domain (and optionally a secondary fault domain). Accordingly, in certain cases, when there is a network connection failure, the VCI is restarted on a host in the preferred fault domain of the VCI. In certain aspects, the VCI itself is directly assigned a preferred fault domain (and optionally a secondary fault domain). In certain aspects, each object, or even each component of each object, of the VCI stored in the stretched cluster is assigned a preferred fault domain (and optionally a secondary fault domain), and by association the VCI is assigned a preferred fault domain (and optionally a secondary fault domain). In certain aspects, each object or each component of each object of the VCI is assigned the same preferred fault domain (and optionally secondary fault domain).
For example, assume a stretched cluster includes a first fault domain including a first set of hosts, and a second fault domain including a second set of hosts. Further, the first fault domain has a network connection to the second fault domain, such that the first set of hosts are connected to the second set of hosts. Further, assume that a first VCI is running on a host of the second set of hosts in the second fault domain. The preferred fault domain of the VCI is set as the first fault domain. The VCI may not be running in its preferred fault domain for a variety of reasons, such as load balancing.
In certain cases, when there is a network connection failure between the first fault domain and the second fault domain, the VCI is restarted in the preferred fault domain, meaning the VCI is restarted on a host of the first set of hosts in the first fault domain, instead of continuing to run in the second fault domain without restarting. Such restarting of the VCI may cause outage time for the VCI, which is not desirable. Such restarting may occur even when both the first set of hosts of the first fault domain and the second set of hosts of the second fault domain are able to communicate with a witness node (e.g., a physical machine or VCI) running in a third fault domain. As both the first set of hosts and the second set of hosts are able to communicate with the witness node, it is known in the system that both fault domains are running properly, and therefore it is not absolutely necessary for the VCI to be restarted in the first fault domain instead of continuing to run in the second fault domain. In particular, to manage fault domains in a stretched cluster, a witness node is established in the third fault domain, which may be referred to as a witness fault domain. The witness node facilitates failover decisions in the event of a fault domain failure or inter-fault domain network failure, as further discussed herein.
Accordingly, certain aspects herein provide techniques for automatically setting the preferred fault domain of a VCI to be the fault domain in which the VCI is currently running. Such techniques provide fault tolerance while reducing the number of circumstances in which a VCI is restarted, thereby reducing VCI outage time. In certain aspects, when a VCI is created, a witness node automatically assigns the preferred fault domain of the VCI to be the fault domain in which the VCI is operating. An administrator may specify a secondary fault domain. In some examples, when the VCI is restarted in a different fault domain (e.g., the secondary fault domain), the system automatically reassigns the different fault domain to be the preferred fault domain.
In certain aspects, each fault domain provides a heartbeat message to the witness node. The heartbeat facilities the witness node directly or indirectly providing the status of each fault domain when a decision to restart VCIs is necessary. This reduces the number of times a VCI may need to be restarted. For example, because the fault domain in which the VCI is operating is its preferred fault domain, inter-fault domain network failures alone do not necessitate the VCI being restarted. Certain aspects are further discussed herein with respect to a site as a fault domain for illustrative purposes. However, it should be understood that this is merely an example, and the techniques herein may be used with any suitable fault domain.
In the illustrated example, each node 111 includes a storage management module (referred to herein as a “VSAN module”) in order to automate storage management workflows (e.g., create objects in the object store, etc.) and provide access to objects in the object store (e.g., handle I/O operations on objects in the object store, etc.) based on predefined storage policies specified for objects in the object store. For example, because a VM may be initially configured by an administrator to have specific storage requirements (or policy) for its “virtual disk” depending on its intended use (e.g., capacity, availability, performance or input/output operations per second (IOPS), etc.), the administrator may define a storage profile or policy for each VM specifying such availability, capacity, performance and the like. The VSAN module may then create an “object” for the specified virtual disk by backing it with physical storage resources of the object store based on the defined storage policy.
A virtualization management platform 105 is associated with cluster 110 of nodes 111. Virtualization management platform 105 enables an administrator to manage the configuration and spawning of the VMs on the various nodes 111. As illustrated in
In some examples, VSAN module 114 may be implemented as a “VSAN” device driver within hypervisor 113. In such an embodiment, VSAN module 114 may provide access to a conceptual “VSAN” 115 through which an administrator can create a number of top-level “device” or namespace objects that are backed by object store 116.
One or more nodes 111 of node cluster 110 may be located at a geographical site that is distinct from the geographical site where one or more other nodes 111 are located, such as in the case of a stretched cluster. For example, some nodes 111 of node cluster 110 may be located at building A while other nodes may be located at building B. In another example, the geographical sites may be more remote such that one geographical site is located in one city or country and the other geographical site is located in another city or country. In such examples, any communications (e.g., I/O operations) between a node at one geographical site and a node at the other remote geographical site may be performed through a network, such as a wide area network (“WAN”). Furthermore, as described below, one or more witness nodes may also be included in cluster 110.
While certain aspects are discussed herein with respect to VCIs for a VSAN storing objects associated with the VCIs, the techniques herein are similarly applicable to other types of VCIs associated with other objects, such as iSCSI, container volumes, etc.
In the illustrated example, the cluster 200 includes two sites 206 and 208 and a witness node 210. In the illustrated example, two nodes 202a and 202b are located at the first site 206, two nodes 202c and 202d are located at the second site 208, and the witness node 210 is located at a third site 212. Storage objects (e.g., components of objects) in the cluster 200 are mirrored between the two sites 206 and 208. The sites 206, 208 and 212 may be different physical racks or may be data centers located geographically remote from each other. The sites 206, 208 and 212 are connected via inter-site network links 214, 216, and 218 through, for example, a wide area network (WAN). In the illustrated example, (i) the first site 206 is communicatively coupled to the second site 208 via a first inter-site network link 214, (ii) the first site 206 is communicatively coupled to the third site 212 via a second inter-site network link 216, and (iii) the second site 208 is communicatively coupled to the third site 212 via a third inter-site network link 218.
The witness node 210 represents a computing entity, such as a VCI or physical machine, that provides external observation of the first site 206 and the second site 208 to facilitate determining which sites are available in the event that the sites 206 and 208 become partitioned from one another (e.g., the first inter-site network link 214 fails, etc.). In some examples, the sites 206 and 208 (e.g., a node 202 in each of the sites) and the witness node 210 vote to determine the status of the sites 206 and 208 based on, for example, communicating with each other via the inter-site network links 214, 216, and 218. The majority of the votes determines how the sites 206 and 208 react to a failure. For example, when the sites 206 and 208 are operating and the inter-site network links 214, 216, and 218 are operating, the first site 206 will vote that the first site 206 is operable, the second site 208 will vote that the first site 206 is operable, and the witness node 210 will vote that the first site is operable.
The witness node 210 stores cross-site witness components, which are meta-data components that are dynamically added for objects (or components of objects) with at least dual site mirroring policies (e.g., objects that are mirrored on stretched cluster across at least two sites) and are used to facilitate determining object availability. The witness component associated with a given component or object may indicate a preferred fault domain (and optionally secondary fault domain) for the component, and therefore indicate a preferred fault domain (and optionally secondary fault domain) for a VCI to which the component belongs. In the event of a failure, each site can use its connection to witness node 210 to determine whether to restart VCIs at that site. In some cases, witness node 210 may be a witness appliance. That is, the witness node 210 provides the ability to distinguish between a failure of the first inter-site network link 214 and a failure of one of the sites 206 and 208.
To facilitate detecting a failure of one of the sites 206 and 208, each of the sites 206 and 208 (e.g., a node 202 in each of the sites) sends a heartbeat message to the witness node 210. The heartbeat message may be a packet with a defined payload indicating it is a heartbeat message. For example, each of the sites 206 and 208 sends a heartbeat message to the witness node 210 once every second. When a site fails or when the inter-network link between the site and the witness node 210 fails, the witness node 210 does not receive the heartbeat message and determines that the corresponding site may have failed. For example, if a heartbeat message is not received for a time period, it may be determined that the corresponding site may have failed.
VCIs may operate at the sites 206 and 208. In the illustrated example of
A change in status of communication between the sites 206, 208 and 210 can trigger a determination at the sites 206 and 208 whether to restart VCIs. This change of status can be caused by a failure of the inter-site network link 214, 216, and/or 218 and/or a failure of one of the sites 206, 208, and 210.
Operations 300 continue at step 304 with automatically registering, by the witness node, the site (e.g., the first site 206 or the second site 208 of
Operations 300 continue at step 306 with monitoring, by the sites (e.g., a node in each of the sites) and the witness node, the status of sites (e.g., failed or operation, etc.) and the status of the inter-network links (e.g., the inter-site network links 214, 216, and 218 of
Operations 300 continue at step 308 with determining, by the sites, whether there has been a site and/or inter-network link failure. For example, the witness may not receive heartbeat messages from the sites (e.g., a node in each of the sites) and/or the one site (e.g., a node in the site) may not receive communication from the other site (e.g., a node in the site).
When there has been a site and/or inter-network link failure, operations 300 continue at step 310 with determining, by the sites (e.g., a node in each of the sites) whether the failure(s) affect the VCI's ability to operate in its preferred fault domain. For example, the VCI's ability to operate in its preferred fault domain is not affected when the site of the preferred fault domain has not failed and when the failures do not cause the secondary fault domain to determine that the preferred fault domain has failed (e.g., via the witness node).
When the VCI's ability to operate in its preferred fault domain is affected, the operations 300 continue at step 312 with restarting the VCI in the site designated as the secondary fault domain.
Operations 400 continue at step 404 with automatically registering, by the witness node, the current site at which the VCI is operating as the preferred fault domain of the VCI and, optionally in some cases, registering the prior preferred fault domain as the secondary fault domain.
Operations 400 continue at step 406 with handling faults that affect the VCI's ability to operate in the preferred fault domain. An example of handling these faults is described at steps 306-312 of operations 300 of
The various embodiments described herein may employ various computer-implemented operations involving data stored in computer systems. For example, these operations may require physical manipulation of physical quantities—usually, though not necessarily, these quantities may take the form of electrical or magnetic signals, where they or representations of them are capable of being stored, transferred, combined, compared, or otherwise manipulated. Further, such manipulations are often referred to in terms, such as producing, identifying, determining, or comparing. Any operations described herein that form part of one or more embodiments of the invention may be useful machine operations. In addition, one or more embodiments of the invention also relate to a device or an apparatus for performing these operations. The apparatus may be specially constructed for specific required purposes, or it may be a general purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
The various embodiments described herein may be practiced with other computer system configurations including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and/or the like.
One or more embodiments of the present invention may be implemented as one or more computer programs or as one or more computer program modules embodied in one or more computer readable media. The term computer readable medium refers to any data storage device that can store data which can thereafter be input to a computer system—computer readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer. Examples of a computer readable medium include a hard drive, network attached storage (NAS), read-only memory, random-access memory (e.g., a flash memory device), a CD (Compact Discs)—CD-ROM, a CD-R, or a CD-RW, a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.
Although one or more embodiments of the present invention have been described in some detail for clarity of understanding, it will be apparent that certain changes and modifications may be made within the scope of the claims. Accordingly, the described embodiments are to be considered as illustrative and not restrictive, and the scope of the claims is not to be limited to details given herein, but may be modified within the scope and equivalents of the claims. In the claims, elements and/or steps do not imply any particular order of operation, unless explicitly stated in the claims.
Virtualization systems in accordance with the various embodiments may be implemented as hosted embodiments, non-hosted embodiments or as embodiments that tend to blur distinctions between the two, are all envisioned. Furthermore, various virtualization operations may be wholly or partially implemented in hardware. For example, a hardware implementation may employ a look-up table for modification of storage access requests to secure non-disk data.
Certain embodiments as described above involve a hardware abstraction layer on top of a host computer. The hardware abstraction layer allows multiple contexts to share the hardware resource. In one embodiment, these contexts are isolated from each other, each having at least a user application running therein. The hardware abstraction layer thus provides benefits of resource isolation and allocation among the contexts. In the foregoing embodiments, virtual machines are used as an example for the contexts and hypervisors as an example for the hardware abstraction layer. As described above, each virtual machine includes a guest operating system in which at least one application runs. It should be noted that these embodiments may also apply to other examples of contexts, such as containers not including a guest operating system, referred to herein as “OS-less containers” (see, e.g., www.docker.com). OS-less containers implement operating system-level virtualization, wherein an abstraction layer is provided on top of the kernel of an operating system on a host computer. The abstraction layer supports multiple OS-less containers each including an application and its dependencies. Each OS-less container runs as an isolated process in userspace on the host operating system and shares the kernel with other containers. The OS-less container relies on the kernel's functionality to make use of resource isolation (CPU, memory, block I/O, network, etc.) and separate namespaces and to completely isolate the application's view of the operating environments. By using OS-less containers, resources can be isolated, services restricted, and processes provisioned to have a private view of the operating system with their own process ID space, file system structure, and network interfaces. Multiple containers can share the same kernel, but each container can be constrained to only use a defined amount of resources such as CPU, memory and I/O. The term “virtualized computing instance” as used herein is meant to encompass both VMs and OS-less containers.
Many variations, modifications, additions, and improvements are possible, regardless the degree of virtualization. The virtualization software can therefore include components of a host, console, or guest operating system that performs virtualization functions. Plural instances may be provided for components, operations or structures described herein as a single instance. Boundaries between various components, operations and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements may fall within the scope of the appended claim(s).