The present disclosure relates to systems, methods, and devices that manage resource utilization within a cloud computing environment.
Computer systems and related technology affect many aspects of society. Indeed, the computer system's ability to process information has transformed the way we live and work. More recently, computer systems have been coupled to one another and to other electronic devices to form both wired and wireless computer networks over which the computer systems and other electronic devices can transfer electronic data. Accordingly, the performance of many computing tasks is distributed across a number of different computer systems and/or a number of different computing environments, such as cloud computing environments. Cloud computing environments may comprise one or more systems that each includes one or more computing nodes (e.g., computer systems operating as servers/hosts). Each node is capable of hosting one or more virtual machines (VMs). In general, a VM provides an isolated environment within which one or more applications execute, potentially along with a supporting operating system. Each node includes an abstraction layer, such as a hypervisor, which provides one or more VMs access to physical resources of the node in an abstracted manner (e.g., by emulating virtual resources for hosted VMs), and which provides isolation between the VMs. For example, from the perspective of any given VM, a hypervisor provides the illusion that the VM is interfacing with a physical resource, even though the VM only interfaces with the appearance (e.g., a virtual resource) of a physical resource. Examples of physical resources include processing capacity (e.g., central processing unit cores), memory, disk space, network bandwidth, media drives, and so forth.
Cloud computing environments are subject to fragmentation of the available resources within the nodes making up the cloud computing environment. One form of fragmentation is the stranding of node resources due to resource bottlenecks. In particular, depending on which one or more virtual machines (VMs) are allocated to a node, and depending on resource requirements of those VM(s), it is possible that one or more of the node's resources cannot be fully used and is thus “stranded.” In particular, when one “bottleneck” resource is fully or nearly fully consumed by the allocated VM(s), allocations of an additional VM to the node may be blocked, causing a “stranded” resource at the node to remain unused, hence wasting that resource.
At least some embodiments described herein use measurements of node resources within a cloud computing environment to identify VM hosting nodes that have stranded resources. In embodiments, measurements include an amount of available memory at a node, a number of unallocated CPU cores or unused CPU cycles at a node, an amount of available disk space at a node, an amount of available network bandwidth at a node, and the like. These embodiments then initiate migration of one or more VMs away from these hosting nodes, thereby making these stranded resources available for VM allocations or other purposes. By doing so, the embodiments described herein automatically and proactively reduce the stranded resources within cloud computing environments. In one example, a stranded resource is a resource of a first type (e.g., memory) on a hosting node which is prevented from being used for VM deployment due to insufficient available resource of a second type (e.g., processing capacity) at the hosting node. Since deploying a VM involves using both memory and processing capacity, lack of one of these types of resource means that it is not possible to deploy a VM and the remaining available resource of the other type is “stranded.” The type of resource which is insufficiently available for VM deployment is referred to as a bottleneck resource.
Technical effects of the embodiments herein include an improved efficiency of resource utilization within cloud computing environments, an increased capacity of cloud computing environments (e.g., with fewer stranded resources, a given number of nodes are able to serve an increased VM demand than otherwise possible), a lower power consumption by cloud computing environments, a lower carbon footprint for cloud computing environments, and improved reliability of cloud computing environments (e.g., through a reduction in capacity-related VM allocation failures).
In some aspects, the techniques described herein relate to a method, implemented by a computer system that includes a processor, for recovery of stranded resources within a cloud computing environment, the method including: measuring a corresponding resource utilization signal at each of a plurality of nodes that each hosts at least one corresponding VM; based on each corresponding resource utilization signal, identifying a set of candidate nodes, each candidate node including a corresponding stranded resource that is unutilized due to utilization of a corresponding bottleneck resource at the candidate node, including calculating an amount of the corresponding stranded resource at each candidate node; from a plurality of VMs hosted at the set of candidate nodes, identifying a set of candidate VMs for migration for stranded resource recovery, including calculating a score for each candidate VM based on a degree of imbalance between the corresponding stranded resource and the corresponding bottleneck resource at a candidate node hosting the candidate VM; and initiating migration of at least one candidate VM in the set of candidate VMs.
In some aspects, the techniques described herein relate to a computer system for recovery of stranded resources within a cloud computing environment, the computer system including: a processor; and a computer storage media that stores computer-executable instructions that are executable by the processor to cause the computer system to at least: measure a corresponding resource utilization signal at each of a plurality of nodes that each hosts at least one corresponding VM; based on each corresponding resource utilization signal, identify a set of candidate nodes, each candidate node including a corresponding stranded resource that is unutilized due to utilization of a corresponding bottleneck resource at the candidate node, including calculating an amount of the corresponding stranded resource at each candidate node; from a plurality of VMs hosted at the set of candidate nodes, identify a set of candidate VMs for migration for stranded resource recovery, including calculating a score for each candidate VM based on a degree of imbalance between the corresponding stranded resource and the corresponding bottleneck resource at a candidate node hosting the candidate VM; and initiate migration of at least one candidate VM in the set of candidate VMs.
In some aspects, the techniques described herein relate to a computer storage media for recovery of stranded resources within a cloud computing environment, the computer storage media storing computer-executable instructions that are executable by a processor to cause a computer system to at least: measure a corresponding resource utilization signal at each of a plurality of nodes that each hosts at least one corresponding VM; based on each corresponding resource utilization signal, identify a set of candidate nodes, each candidate node including a corresponding stranded resource that is unutilized due to utilization of a corresponding bottleneck resource at the candidate node, including calculating an amount of the corresponding stranded resource at each candidate node; from a plurality of VMs hosted at the set of candidate nodes, identify a set of candidate VMs for migration for stranded resource recovery, including calculating a score for each candidate VM based on a degree of imbalance between the corresponding stranded resource and the corresponding bottleneck resource at a candidate node hosting the candidate VM; and initiate migration of at least one candidate VM in the set of candidate VMs.
In some aspects of the techniques described herein, each corresponding resource utilization signal indicates utilization of at least one of: a memory resource, a central processing unit resource, a disk resource, a network resource, or a platform resource.
In some aspects of the techniques described herein, identifying the set of candidate nodes also includes: identifying, from the plurality of nodes, at least one first node that is included on an allow list or a block list; and excluding the at least one first node from the set of candidate nodes.
In some aspects of the techniques described herein, identifying the set of candidate nodes also includes: determining, from the plurality of nodes, at least one second node whose corresponding stranded resource is temporarily stranded; and excluding the at least one second node from the set of candidate nodes.
In some aspects of the techniques described herein, identifying the set of candidate nodes also includes sorting the set of candidate nodes based on the amount of the corresponding stranded resource at each candidate node.
In some aspects of the techniques described herein, identifying the set of candidate VMs also includes: identifying, from the plurality of VMs hosted at the set of candidate nodes, at least one first VM as being a priority VM; and excluding the at least one first VM from the set of candidate VMs.
In some aspects of the techniques described herein, identifying the set of candidate VMs also includes: identifying, from the plurality of VMs hosted at the set of candidate nodes, at least one second VM as being a short-lived VM; and excluding the at least one second VM from the set of candidate VMs.
In some aspects of the techniques described herein, identifying the set of candidate VMs also includes sorting the set of candidate VMs based on the score for each candidate VM.
In some aspects of the techniques described herein, identifying the set of candidate VMs also includes limiting a number of candidate VMs in the set of candidate VMs based on having sorted the set of candidate VMs.
In some aspects of the techniques described herein, initiating migration of at least one candidate VM in the set of candidate VMs includes selecting the at least one candidate VM for migration based on having sorted the set of candidate VMs.
In some aspects of the techniques described herein, initiating migration of at least one candidate VM in the set of candidate VMs includes migrating the at least one candidate VM from a first node of the plurality of nodes to a second node of the plurality of nodes.
In some aspects of the techniques described herein, initiating migration of at least one candidate VM in the set of candidate VMs includes delaying the migration until a node maintenance event.
Some aspects of the techniques described herein further include tracking a stranded resource metric across the plurality of nodes.
In some aspects of the techniques described herein, identifying the set of candidate nodes also includes at least one of: considering a first resource configuration of a first individual VM that is queued for allocation within the plurality of nodes; considering a second resource configuration of a second individual VM that is predicted for allocation within the plurality of nodes; considering a third resource configuration of a first set of a plurality of VMs that is queued for allocation within the plurality of nodes; or considering a fourth resource configuration of a second set of a plurality of VMs that is predicted for allocation within the plurality of nodes.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In order to describe the manner in which the above-recited and other advantages and features of the invention can be obtained, a more particular description of the invention briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
Fragmentation is a factor that drives efficiency of cloud computing environments. Fragmentation in cloud computing environments can happen due to a myriad of reasons (e.g., due to how a virtual machine (VM) allocator allocates the various VMs into existing nodes, due to customers organically deleting VMs and leaving fragmented cores on nodes, due to platform constraints such as dedicated hosts). One way to mitigate fragmentation in cloud computing environments is to migrate VMs between nodes, in order to pack those VMs into fewer nodes and create empty nodes onto which new VMs can be allocated.
The inventors have recognized that one mode in which fragmentation manifests in cloud computing environments is via stranding of node resources. Each node has a specific configuration of available resources, such as central processing unit (CPU) cores, memory resources, disk resources, and the like. Additionally, some VMs may have a corresponding requirement for those resources. For example, a VM may need to be allocated a certain number of CPU cores, a certain amount of memory, a certain amount of disk space or input/output (IO) bandwidth, etc. Thus, depending on which VM (or VMs) are allocated to a particular node, it is possible that one or more of the node's resources cannot be fully utilized because one or more bottleneck resources block further allocations of VMs to the node, which strands one or more unused or unallocated resources at the node.
In one example, one or more VMs with relatively high required memory-to-cores ratios are allocated to a node, which fully utilizes the memory on the node while leaving unused CPU cores on the node.
In another example, one or more VMs allocated to a node fully utilize the CPU cores of the node and unused memory is rendered stranded. The resources on a node may also be stranded due to certain platform constraints. For example, there may be a platform limit on the number of VMs that can be hosted on a given node, and it is possible that a node may be hosting the maximum number of allowed VMs-stranding unused CPU cores, memory, disk space, etc.
At least some embodiments described herein use measurements of node resources within a cloud computing environment to identify VM hosting nodes that have stranded resources. These embodiments then initiate migration of VMs causing resource bottlenecks away from those hosting nodes, thereby making previously stranded resources available for VM allocations or other purposes. By doing so, the embodiments described herein automatically and proactively reduce the stranded resources within cloud computing environments.
As such, the embodiments herein improve the efficiency of resource utilization within cloud computing environments. Since the stranded resources effectively lead to lowering the available capacity of a given set of nodes within a cloud computing environment, the improved resource utilization efficiency enabled by the embodiments herein leads to an overall capacity increase of the cloud computing environment, enabling a fewer number of nodes to serve a given VM demand. This leads to lower power consumption by the cloud computing environment, and a lower carbon footprint for the cloud computing environment. Additionally, in some cases, stranding of resources can cause capacity-related VM allocation failures. Thus, the improved resource utilization efficiency enabled by the embodiments herein can improve reliability of the cloud computing environment.
As used herein, a “virtual machine” (VM) is any isolated execution environment, such as an emulated execution environment created by an emulator (e.g., BOCHS, QEMU), a virtualized execution environment created by a hypervisor (e.g., HYPER-V, XEN, KVM), a user-space instance (e.g., container, zone, partition, jail) created via OS-level virtualization, and the like.
In embodiments, the rescue system 201 uses measurements of resource utilization at the nodes 216 to proactively identify and recover stranded resources among the resources 217 of the nodes 216. As non-limiting examples only, in one embodiment the rescue system 201 rescues stranded CPU cores on one or more of nodes 216 where the VMs 218 hosted at 30) those nodes have utilized all or almost all the node's memory; in another embodiment the rescue system 201 rescues stranded memory on a node where the VMs 218 hosted at those nodes have utilized all or almost all the node's CPU cores; in another embodiment the rescue system 201 rescues stranded CPU cores on a node where the VMs 218 hosted at those nodes have hit or approached a limit on a maximum number of VMs that a cloud computing environment can support on a given node; and in another embodiment the rescue system 201 rescues stranded CPU cores where the VMs 218 hosted at those nodes have utilized all or almost all the node's available disk space.
In embodiments, the rescue system 201 operates in a fully- or semi-automatic manner, such as through a cloud or datacenter management fabric. For instance, in one embodiment the rescue system 201 operates continuously in the background and has configurable settings to run live migrations, such as a live migration on N nodes every M minutes.
As mentioned, the rescue system 201 uses measurements of resource utilization at the nodes 216. Thus, as shown, the rescue system 201 includes a measuring component 202 that collects signals regarding utilization of resources 217 at the nodes 216. In some embodiments, the measuring component 202 actively (e.g., via probing of the nodes 216) or passively (e.g., via listening to signal emitted by the nodes 216) collects these signals by directly communicating with each of nodes 216. In additional, or alternative, embodiments, the measuring component 202 collects these signals indirectly via a monitoring component 214 which, in embodiments, monitors overall stranded resources across one or more cloud computing environments. In embodiments, measurements include information sufficient to determine an amount of available memory at a node, a number of unallocated CPU cores or unused CPU cycles at a node, an amount of available disk space at a node, an amount of available network bandwidth at a node, and the like.
Based on the signals measured by the measuring component 202, the rescue system 201 uses a candidate node selection component 203 to intelligently identify one or more nodes of the nodes 216 that currently have one or more stranded resources that are unutilized due to utilization of a corresponding bottleneck resource at the node. In embodiments, a stranded resource is stranded by utilization of a corresponding bottleneck resource due to the bottleneck resource being fully utilized (i.e., exhausted) at the node, such as in
In some embodiments, the candidate node selection component 203 includes a node filtering component 204. In embodiments, the node filtering component 204 excludes one or more of nodes 216 from further consideration by the candidate node selection component 203. For instance, the candidate node selection component 203 may operate to exclude certain nodes from further consideration due to platform constraints. In one example, the node filtering component 204 excludes any of nodes 216 that are not identified from an allow list, and in another example the node filtering component 204 excludes any of nodes 216 that are identified from a block list. In some embodiments, the node filtering component 204 operates at a cluster level, to exclude clusters of a plurality of nodes from further consideration by the candidate node selection component 203. Thus, in embodiments, an allow or a block list refers to nodes directly, and/or to clusters of nodes.
Referring to process flow 300, after operation of the node filtering component 204 at step 302 (filter nodes based on an allow list or a block list), a set 301a of all nodes (e.g., nodes 216) becomes a set 301b of filtered nodes. It will be appreciated that, in some embodiments, the node filtering component 204 is excluded from the node selection component 203, and/or step 302 is omitted from process flow 300.
The candidate node selection component 203 also includes a stranded node identification component 205. In embodiments, the stranded node identification component 205 applies one or more rules or heuristics to signals gathered by the measuring component 202 to determine if one or more of nodes 216 have one or more stranded resources. In one example, the stranded node identification component 205 identifies resource A (a stranded resource) as being stranded by resource B (a bottleneck resource) on a particular node when the following evaluates to true:
where:
In embodiments, if the stranded node identification component 205 identifies resource A as a stranded resource on a node, then the stranded node identification component 205 calculates an amount of the resource stranded on the node as follows:
1−Utilization_of_A
In a particular, for inferring if a node has stranded CPU cores due to memory being full, signals indicating CPU cores utilization and memory utilization on the node can be used to consider the node as stranding cores due to memory full if the below quantity evaluates to true:
where:
In embodiments, when determining if one or more of nodes 216 have one or more stranded resources, the stranded node identification component 205 considers attributes of one or more VMs that are, or that could be, allocated one or more of nodes 216. In one example, the stranded node identification component 205 considers an available VM template definition that defines, e.g., an amount of memory, a number of CPU cores, etc. required for a VM instantiated based on the VM template definition. In another example, the stranded node identification component 205 considers a set of potential VMs that are valid to place on one or more of nodes 216 (e.g., due to policy, contractual, security, or other reasons). In another example, the stranded node identification component 205 considers current set of VMs (including the resources required by those VMs) that are queued for placement. In another example, the stranded node identification component 205 considers a predicted set of VMs that will need to be generated in an upcoming time period (e.g., hour, N hours, etc.). In embodiments, the stranded node identification component 205 considers combinations of one or more of the foregoing.
In embodiments, the stranded node identification component 205 considers a plurality of VMs as a set, in addition to (or rather than) individually. In one example, the stranded node identification component 205 operates to identify a resource as being stranded when it prevents allocation of the set of VMs as a whole, even if that resource may be sufficient to enable allocation of one of the VMs of the set individually. In another example, the stranded node identification component 205 operates to identify a resource as being stranded when there is a customer requirement to only run the plurality of VMs on the same physical machine (e.g., due to high security or classification).
Referring to process flow 300, after operation of the candidate node selection component 203 at step 303 (identify nodes with stranded resources), the set 301b of filtered nodes becomes a set 301c of stranding nodes.
In some embodiments, the candidate node selection component 203 also includes a temporary stranding exclusion component 206. In embodiments, the temporary stranding exclusion component 206 excludes one or more nodes that are only temporarily stranding resources. For example, the temporary stranding exclusion component 206 may only consider a node to be a stranding node when the majority of VMs hosted on node are considered as “long running” VMs (e.g., having already run longer than a predetermined number of minutes, hours, days, etc.). Thus, the temporary stranding exclusion component 206 excludes a node if a majority of VMs at the node are considered to be short-lived (i.e., not long running). In some embodiments, the temporary stranding exclusion component 206 applies the following rule to check whether the “majority” rule is satisfied for a node:
where:
Referring to process flow 300, after operation of the temporary stranding exclusion component 206 at step 304 (exclude nodes that are temporarily stranding resources), the set 301c of stranding nodes becomes a set 301d of candidate nodes. It will be appreciated that, in some embodiments, the temporary stranding exclusion component 206 is excluded from the node selection component 203, and/or step 304 is omitted from process flow 300.
In some embodiments, the candidate node selection component 203 also includes a node sorting component 207. In embodiments, the node sorting component 207 sorts the set of candidate nodes by an amount of resource stranded at each node. In embodiments, the node sorting component 207 sorts the set of candidate nodes in descending order (i.e., to 10) create an ordered set of candidate nodes), such that the node sorting component 207 sorts a first node with a higher amount of stranded resource prior to (or higher than) a second node with a lower amount of stranded resource. As a result of this sorting, the first node is given higher priority for VM migration than the second node.
Referring to process flow 300, after operation of the node sorting component 207 at step 305 (sort candidate nodes), the set 301d of candidate nodes becomes a set 301e of sorted candidate nodes. It will be appreciated that, in some embodiments, the node sorting component 207 is excluded from the candidate node selection component 203 and/or step 305 is omitted from process flow 300.
Based on the set of candidate nodes (e.g., set 301d) identified by the candidate node selection component 203, the rescue system 201 uses a candidate VM selection component 208 to intelligently identify one or more “bottleneck” VMs on those candidate nodes that can be migrated from those candidate nodes in order to rescue stranded resources. These VMs become a set of candidate VMs, at least one of which is subsequently migrated from its respective candidate node in order to rescue stranded resources at that node. Example operation of the candidate VM selection component 208 is now described in reference to
In some embodiments, the candidate VM selection component 208 includes a VM filtering component 209. In embodiments, the VM filtering component 209 excludes one or more VMs hosted by the candidate node(s) from being selected as candidates for migration based on one or more filtering criteria. The criteria used by the VM filtering component 209 can vary, but in embodiments, the VM filtering component 209 excludes VMs that are considered to be sensitive, such as based on a type of workload executed on those VMs (e.g., which may be adversely affected by a VM migration), based on a service-level agreement with a customer to whom the VMs is associated, etc.
Referring to process flow 400, after operation of the VM filtering component 209 at step 402 (filter VMs based on sensitivity), a set 401a of VMs on candidate node(s) becomes a set 401b of filtered VMs. It will be appreciated that, in some embodiments, the VM filtering component 209 is excluded from the candidate VM selection component 208 and/or step 402 is omitted from process flow 400.
In some embodiments, the candidate VM selection component 208 also includes a short-lived VM exclusion component 210. In embodiments, the short-lived VM exclusion component 210 excludes one or more VMs that are short-lived from being candidates for migration. For example, the short-lived VM exclusion component 210 may consider a VM to be short-lived if it has been running for less than a predetermined number of minutes, hours, days, etc. In embodiments, exclusion of short-lived VMs from migration reduces the downtime for short-lived VMs that might be otherwise organically deleted by the VM's owner.
Referring to process flow 400, after operation of the short-lived VM exclusion component 210 at step 403 (exclude short-lived VMs), the set 401b of filtered VMs becomes a set 401c of a subset of VMs on candidate node(s). It will be appreciated that, in some embodiments, the short-lived VM exclusion component 210 is excluded from the candidate VM selection component 208 and/or step 403 is omitted from process flow 400.
In some embodiments, the candidate VM selection component 208 also includes a VM sorting component 211. In embodiments, the VM sorting component 211 scores one or more sets of candidate VMs based on a scoring function. In an example, the VM sorting component 211 uses a scoring function that measures a degree of imbalance between a stranded resource at the node and a bottleneck resource used by at least one VM at the node. This measurement quantifies the contribution of individual VMs to causing the stranding of the stranded resource on the node. For example, for a scenario where one or more of a node's CPU cores are stranded due to the node's memory being full, the migration of a VM whose memory utilization exceeds the CPU core utilization at a greater level than another VM has the strongest effect in rescuing the stranded CPU core(s). In embodiments, this scoring function is defined as:
Utilization_of_Bottleneck_Resource-Utilization_of_Stranded_Resource where:
In embodiments, the VM sorting component 211 also sorts the VMs in descending order based on each VM's score. In some embodiments, the VM sorting component 211 sorts VMs across all candidate nodes. In other embodiments, the VM sorting component 211 sorts VMs at each candidate node. Referring to process flow 400, after operation of the VM sorting component 211 at step 404 (score and sort VMs), the set 401c of a subset of VMs on candidate node(s) becomes a set 401d of sorted VMs.
In embodiments, the VM sorting component 211 also selects the N highest scored VMs (with positive values) as a set of candidate VMs for migration. Referring to process flow 400, after operation of the VM sorting component 211 at step 404 (select top N VMs), a set 401d of sorted VMs becomes a set 401e of candidate VMs (e.g., at a particular node, or across all candidate nodes). It will be appreciated that, in some embodiments, the VM sorting component 211 is excluded from the candidate VM selection component 208 and/or one or more of step 404 or step 405 is omitted from process flow 400.
Based on the set of VMs (e.g., set 401e) identified by the candidate VM selection component 208, the rescue system 201 uses a VM migration orchestration component 212 to determine at least one of the candidate VMs that is to be migrated away from its hosting node. In embodiments, the VM migration orchestration component 212 initiates that migration using a VM placement component 213. In embodiments, the VM migration orchestration component 212 prioritizes migration of VMs based on the sorted order created by the VM sorting component 211.
In some embodiments, the VM migration orchestration component 212 delays migrations until node maintenance events (e.g., operating system updates, node reboots, etc.) in order to opportunistically piggyback on those maintenance events to achieve rescue of stranded resources, while not causing an additional VM downtime than would otherwise be incurred by the maintenance.
In some embodiments, the VM migration orchestration component 212 includes a configurable safeguard that ensures that there is a maximum number of parallelly executing VM migrations per cluster at any given time.
As mentioned, in some embodiments the measuring component 202 collects signals regarding resource utilization at the nodes 216 indirectly via a monitoring component 214, and that the monitoring component 214, monitors overall stranded resources across one or more cloud computing environments. As shown, in embodiments, the monitoring component 214 also includes an efficiency tracking component 215. In embodiments the efficiency tracking component 215 tracks resource utilization efficiency among at least the nodes 216 of computer architecture 200, such as by tracking a percentage of stranded resources across the nodes 216. In embodiments, the monitoring component 214 communicates resource utilization efficiency information to the rescue system 201, and the rescue system 201 uses this information to ensure that the nodes 216 are operating within target stranded resource limits.
A further description of the computer architecture 200 is now provided in connection with
The following discussion now refers to a number of methods and method acts. Although the method acts may be discussed in certain orders, or may be illustrated in a flow chart as occurring in a particular order, no particular ordering is required unless specifically stated, or required because an act is dependent on another act being completed prior to the act being performed.
Referring to
Referring to
Referring to
Referring to
Referring to
Referring to
As mentioned, in embodiments, the stranded node identification component 205 considers attributes of one or more VMs that are, or that could be, allocated one or more of nodes 216. As mentioned this can include considering a plurality of VMs as a set, in addition to (or rather than) individually. Thus, in some embodiments of act 502, identifying the set of candidate nodes also includes at least one of considering a first resource configuration of a first individual VM that is queued for allocation within the plurality of nodes, considering a second resource configuration of a second individual VM that is predicted for allocation within the plurality of nodes; considering a third resource configuration of a first set of a plurality of VMs that is queued for allocation within the plurality of nodes; or considering a fourth resource configuration of a second set of a plurality of VMs that is predicted for allocation within the plurality of nodes.
It is noted that, in act 502, there is no particular ordering shown among act 502a, act 502b, act 502c, and act 502d. Thus, the description herein does not impose any particular ordering among these acts. Additionally, as indicated by broken lines, some of these acts may be optional.
Referring to
In some embodiments of act 503, the candidate VM selection component 208 selects a different set of candidate VMs for each candidate node in the set of candidate nodes. In other embodiments of act 503, the candidate VM selection component 208 selects a single set of candidate VMs across the set of candidate nodes.
Referring to
Referring to
Referring to
Referring to
Referring to
It is noted that, in act 503, there is no particular ordering shown among act 503a, act 503b, act 503c, act 503d, and act 503e. Thus, the description herein does not impose any particular ordering among these acts. Additionally, as indicated by broken lines, some of these acts may be optional.
Referring to
As mentioned, some embodiments of method 500 include sorting candidate VMs based on imbalance score (act 503d). Thus, in some embodiments of act 504, initiating migration of at least one candidate VM in the set of candidate VMs includes selecting the at least one candidate VM for migration based on having sorted the set of candidate VMs.
As mentioned, in some embodiments, the VM migration orchestration component 212 delays migrations until node maintenance events (e.g., operating system updates, node reboots, etc.) in order to opportunistically piggybacking on those maintenance events. Thus, in some embodiments of act 504, initiating migration of at least one candidate VM in the set of candidate VMs includes delaying the migration until a node maintenance event.
As mentioned, in some embodiments an efficiency tracking component 215 tracks resource utilization efficiency among nodes 216, and the rescue system 201 uses this information to ensure that the nodes 216 are operating within target stranded resource limits. Thus, some embodiments of method 500 also include tracking a stranded resource metric across the plurality of nodes.
Accordingly, the embodiments described herein use measurements of node resources within a cloud computing environment to identify VM hosting nodes that have stranded resources, and then initiate migration of VMs causing resource bottlenecks away from these hosting nodes-thereby making these stranded resources available for VM allocations or other purposes. By doing so, the embodiments described herein automatically and proactively reduce the stranded resources within cloud computing environments.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above, or the order of the acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.
Embodiments of the present invention may comprise or utilize a special-purpose or general-purpose computer system that includes computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present invention also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. Such computer-readable media can be any available media that can be accessed by a general-purpose or special-purpose computer system. Computer-readable media that store computer-executable instructions and/or data structures are computer storage media. Computer-readable media that carry computer-executable instructions and/or data structures are transmission media. Thus, by way of example, and not limitation, embodiments of the invention can comprise at least two distinctly different kinds of computer-readable media: computer storage media and transmission media.
Computer storage media are physical (i.e., hardware) storage media that store computer-executable instructions and/or data structures. Physical storage media include computer hardware, such as RAM, ROM, EEPROM, solid state drives (“SSDs”), flash memory, phase-change memory (“PCM”), optical disk storage, magnetic disk storage or other magnetic storage devices, or any other hardware storage device(s) which can be used to store program code in the form of computer-executable instructions or data structures, which can be accessed and executed by a general-purpose or special-purpose computer system to implement the disclosed functionality of the invention.
Transmission media can include a network and/or data links which can be used to carry program code in the form of computer-executable instructions or data structures, and which can be accessed by a general-purpose or special-purpose computer system. A “network” is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer system, the computer system may view the connection as transmission media. Combinations of the above should also be included within the scope of computer-readable media.
Further, upon reaching various computer system components, program code in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to computer storage media (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a “NIC”), and then eventually transferred to computer system RAM and/or to less volatile computer storage media at a computer system. Thus, it should be understood that computer storage media can be included in computer system components that also (or even primarily) utilize transmission media.
Computer-executable instructions comprise, for example, instructions and data which, when executed at one or more processors, cause a general-purpose computer system, special-purpose computer system, or special-purpose processing device to perform a certain function or group of functions. Computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code.
Those skilled in the art will appreciate that the invention may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The invention may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. As such, in a distributed system environment, a computer system may include a plurality of constituent computer systems. In a distributed system environment, program modules may be located in both local and remote memory storage devices.
Those skilled in the art will also appreciate that the invention may be practiced in a cloud computing environment. Cloud computing environments may be distributed, although this is not required. When distributed, cloud computing environments may be distributed internationally within an organization and/or have components possessed across multiple organizations. In this description and the following claims, “cloud computing” is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers/nodes, storage, applications, and services). The definition of “cloud computing” is not limited to any of the other numerous advantages that can be obtained from such a model when properly deployed.
A cloud computing model can be composed of various characteristics, such as on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud computing model may also come in the form of various service models such as, for example, Software as a Service (“SaaS”), Platform as a Service (“PaaS”), and Infrastructure as a Service (“IaaS”). The cloud computing model may also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth.
Some embodiments, such as a cloud computing environment, may comprise a system that includes one or more hosts that are each capable of running one or more virtual machines. During operation, virtual machines emulate an operational computing system, supporting an operating system and perhaps one or more other applications as well. In some embodiments, each host includes a hypervisor that emulates virtual resources for the virtual machines using physical resources that are abstracted from view of the virtual machines. The hypervisor also provides proper isolation between the virtual machines. Thus, from the perspective of any given virtual machine, the hypervisor provides the illusion that the virtual machine is interfacing with a physical resource, even though the virtual machine only interfaces with the appearance (e.g., a virtual resource) of a physical resource. Examples of physical resources including processing capacity, memory, disk space, network bandwidth, media drives, and so forth.
The present invention may be embodied in other specific forms without departing from its essential characteristics. Such embodiments may include a data processing device comprising means for carrying out one or more of the methods described herein; a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out one or more of the methods described herein; and/or a computer-readable medium (computer storage media) comprising instructions which, when executed by a computer, cause the computer to carry out one or more of the methods described herein. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope. When introducing elements in the appended claims, the articles “a,” “an,” “the,” and “said” are intended to mean there are one or more of the elements. The terms “comprising,” “including,” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. Unless otherwise specified, the terms “set,” “superset,” and “subset” are intended to exclude an empty set, and thus “set” is defined as a non-empty set, “superset” is defined as a non-empty superset, and “subset” is defined as a non-empty subset. Unless otherwise specified, the term “subset” excludes the entirety of its superset (i.e., the superset contains at least one item not included in the subset). Unless otherwise specified, a “superset” can include at least one additional element, and a “subset” can exclude at least one element.
Number | Date | Country | Kind |
---|---|---|---|
LU500944 | Dec 2021 | LU | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2022/080508 | 11/28/2022 | WO |