The present invention relates to cluster computing environments in which a plurality of computational resources or different cluster machines collaboratively execute computational workloads, and in particular, to methods and systems which use machine learning for allocating the computational resources in the cluster computing environment.
Placement of workloads in a cluster is a generic, well-known, and classic problem: given a cluster of machines with workloads already running on them, which machine should a new workload be scheduled on for optimal performance? Simple solutions include round-robin assignment to machines, or choosing the machine with the lowest number of currently running jobs. However, correct placement can be complicated, because different workloads tend to have different effects on each other (the “noisy neighbor effect”): a good example is two workloads that are both I/O-heavy, which can both perform at much less than 50% of their performance when run together due to a phenomenon known as thrashing.
More elaborate solutions require prior knowledge about the running workloads, e.g., which workloads are running on each machine and what is the new workload that has to be allocated. Making this information available requires detailed knowledge about running applications, and solutions for performing resource allocation need to rely on hand-crafted heuristics that are tuned for the target workloads, hardware and applications.
U.S. Pat. No. 9,959,146 B2 describes a method of scheduling workloads to computing resources of a data canter which predicts operating values for the computing resources for the scheduling.
In an embodiment, the present invention provides a method for allocating a workload to at least one cluster machine of a plurality of cluster machines which are part of a computer cluster operating in a cluster computing environment. Values from hardware performance counters of each of the cluster machines are collected while the cluster machines are running different workloads. A value of a hardware performance counter from a system which executed the workload to be allocated in isolation and the values from the hardware performance counters of each of the cluster machines which are running the different workloads are used as input to a machine learning algorithm trained to provide as output in each case a prediction of a performance of the workload on each of the cluster machines which are running the different workloads. The at least one cluster machine is selected for placement of the workload based on the predictions.
The present invention will be described in even greater detail below based on the exemplary figures. The invention is not limited to the exemplary embodiments. All features described and/or illustrated herein can be used alone or combined in different combinations in embodiments of the invention. The features and advantages of various embodiments of the present invention will become apparent by reading the following detailed description with reference to the attached drawings which illustrate the following:
Embodiments of the present invention solve the resource allocation problem in a computer cluster using a machine learning method that combines information from hardware performance counters of the computers in the cluster to make an application placement decision, with no additional prior knowledge about application details and the running workloads on the target computers.
Embodiments of the present invention further provide a more automated and effective approach for solving the resource allocation problem in a computer cluster. Preferably, the approach includes continuously collecting data from all machines in the cluster to characterize the load on the system. To keep this overhead low, an embodiment of the present inventions focuses on hardware performance counters, which can be collected with hardware support in commodity central processing units (CPUs), such as with the x86 instruction set architectures, at very little overhead. In addition, a workload is profiled once on its own to collect the same performance counters. This can happen during development time or once when allocating a new application or workload.
According to an embodiment, the present invention then takes the hardware performance counters data of the workload running alone, as well as the data from each cluster machine, to execute a placement decision that optimizes the performance (e.g., decided by an arbitrary, user-defined key performance indicator (KPI)) for this new workload. To predict how well a given workload would run on a given machine, a machine learning algorithm, based on artificial neural networks, is used.
In an embodiment, the present invention provides a method for allocating a workload to at least one cluster machine of a plurality of cluster machines which are part of a computer cluster operating in a cluster computing environment. Values from hardware performance counters of each of the cluster machines are collected while the cluster machines are running different workloads. A value of a hardware performance counter from a system which executed the workload to be allocated in isolation and the values from the hardware performance counters of each of the cluster machines which are running the different workloads are used as input to a machine learning algorithm trained to provide as output in each case a prediction of a performance of the workload on each of the cluster machines which are running the different workloads. The at least one cluster machine is selected for placement of the workload based on the predictions
In a same or different embodiment, the machine learning algorithm is trained to provide as the output in each case a KPI worsening factor representing a ratio of an expected KPI of the workload on each of the cluster machines which are running the different workloads and a measured KPI from the system which executed the workload to be allocated in isolation. The at least one cluster machine predicted to have the lowest KPI worsening factor is selected for placement of the workload.
In a same or different embodiment, the input to the machine learning algorithm further includes in each case a number of the different workloads currently running on each of the cluster machines.
In a same or different embodiment, the machine learning algorithm uses an artificial neural network as a machine learning model.
In a same or different embodiment, the values of the hardware performance counters are combined with each other.
In a same or different embodiment, the method further comprises: executing the workload after placement of the workload on the at least one cluster machine concurrently with at least one other workload; collecting values from the hardware performance counters of the at least one cluster machine and measuring a KPI while the at least one cluster machine executes the workload concurrently with the at least one other workload; and using the values of the hardware performance counters of the at least one cluster machine and the measured KPI as training data for the machine learning algorithm.
In a same or different embodiment, the machine learning algorithm follows construction rules of a multi-layer perceptron.
In a same or different embodiment, the method further comprises executing a new workload in the system in isolation, or in another system or one of the cluster machines in isolation, and collecting values from the hardware performance counters during execution of the new workload.
In a same or different embodiment, the method further comprises receiving a user-specified KPI characterizing the performance of the workload.
In another embodiment, the present invention provides a system for allocating a workload to at least one cluster machine of a plurality of cluster machines which are part of a computer cluster operating in a cluster computing environment. The system comprises memory and one or more computer processors which, alone or in combination, are configured to provide for execution of a method comprising: collecting values from hardware performance counters of each of the cluster machines while the cluster machines are running different workloads; using a value of a hardware performance counter from a system which executed the workload to be allocated in isolation and the values from the hardware performance counters of each of the cluster machines which are running the different workloads as input to a machine learning algorithm trained to provide as output in each case a prediction of a performance of the workload on each of the cluster machines which are running the different workloads; and selecting the at least one cluster machine for placement of the workload based on the predictions.
In a same or different embodiment, the machine learning algorithm is trained to provide as the output in each case a KPI worsening factor representing a ratio of an expected KPI of the workload on each of the cluster machines which are running the different workloads and a measured KPI from the system which executed the workload to be allocated in isolation. The at least one cluster machine predicted to have the lowest KPI worsening factor is selected for placement of the workload.
In a further embodiment, the present invention provides a tangible, non-transitory computer-readable medium having instructions thereon which, upon execution by one or more processors with access to memory, provides for execution of the method for allocating a workload according to an embodiment of the invention.
In an even further embodiment, the present invention provides a method for training a machine learning algorithm for use in allocating workloads to cluster machines which are part of a computer cluster operating in a cluster computing environment. Values from hardware performance counters of each of the cluster machines which are running a first workload concurrently with other workloads are collected and a KPI while the cluster machines are running the first workload is measured. The values from the hardware performance counters of each of the cluster machines are combined in each case with a value from a hardware performance counter of a system that executed the first workload in isolation. The combined values from the hardware performance counters are provides in each case as input to the machine learning algorithm and using the measured KPI for output labels such that the machine learning algorithm adapts its weights and parameters based thereon.
In a same or different embodiment, wherein the output labels are in each case a KPI worsening factor representing a ratio of the measured KPI of the first workload on each of the cluster machines and a measured KPI from the system which executed the first workload in isolation.
In another embodiment, the present invention provides a tangible, non-transitory computer-readable medium having instructions thereon which, upon execution by one or more processors with access to memory, provides for execution of the method for training the machine learning algorithm according to an embodiment of the invention.
According to an embodiment schematically illustrated by the prediction system 20 of
The prediction system 20 can be bootstrapped by starting with an empty model with no information and using a preexisting standard algorithm (such as round-robin deployment) for placement of workloads. As data is collected about the behavior of co-located workloads, the model used by the machine learning algorithm 40 of step S3 can be trained with this data and then be used to make placement decisions, with the predictions becoming more accurate as the results of placement decisions are collected and fed into the model for regular retraining.
In contrast to U.S. Pat. No. 9,959,146 B2, embodiments of the present invention are able to predict the performance of a workload when put on different cluster machines, and then is able to decide where to put it, based on that prediction. U.S. Pat. No. 9,959,146 B2 does not describe any way to predict or optimize the performance of a workload, but rather describes a way to predict how a workload will impact overall load on different machines. Thus, U.S. Pat. No. 9,959,146 B2 has a machine-centered which provides to schedule workloads in a manner which does not overload the machines, while embodiments of the present invention have a workload-centered view and optimize the performance of that workload. Accordingly, embodiments of the present invention advantageously provide for the allocation of computer resources in a manner that enhances system performance by providing for better performance of incoming workloads by the placement decisions made in accordance with embodiments of the present invention, thereby saving time, computational costs and effort, and freeing up computational resources for other workloads.
The input of the machine learning algorithm 40 include application profiling data such as the hardware performance counter measurements during the stand-alone execution of the application workload that is to be placed, and target system monitoring data such as the current hardware performance counter measurements of the machine for which placement is to be evaluated and, optionally, the number of jobs already running on that machine.
The output of the machine learning algorithm 40 is a score that is either the raw KPI of the workload that is to be run on the machine that is currently being considered, compared to running it on a dedicated machine on its own, or is the KPI worsening factor that describes how much worse the application-specific KPI becomes if the workload is to be run on the machine that is currently being considered. By running a prediction for a new workload against the current load of all cluster machines, the machine that provides the best performance is identified.
Whenever a workload is being added to a machine, the continuously measured hardware performance counters of the overall machine load (comprising one or several workloads), as well as the actual KPI for the workload, are collected and form the ground truth of the actual performance of the workload in that environment. If instead of using the raw KPI, the KPI worsening factor is used, it can now be calculated by comparing to the (previously collected in step S1) KPI for the workload on a stand-alone system. Taking the (also collected in step S1) hardware performance counters for a stand-alone run, the hardware performance counters of the machine that it is co-located on with other loads, and, optionally, the number of such loads, gives the same input as for the prediction step, together with the actual KPI or KPI worsening factor that is used as the expected result (label). Thus, training of the machine learning algorithm 40 can be done on newly observed performance behavior of workload applications In other words, the training approach creates training data by measuring the input and output of the machine learning algorithm 40, allowing the machine learning algorithm 40 to predict on the input data, and then feeding that prediction and the actual measured data back so that the machine learning algorithm 40 can update its parameters and thus gradually improve its model and predictions.
The machine learning algorithm 40 employed follows the construction rules of a multi-layer perceptron (MLP). In between the input layer (comprising, as described above, the performance counters of a workload when run on its own, the performance counters of a machine on which one or several other workloads are already running, and optionally the number of the workloads on that machine) and the output layer (that produces as output a single value, the KPI or the KPI worsening factor), there are several hidden dense layers (where each node of a layer is connected to every node at the next layer), each with a non-linear activation function.
In addition to the training phase and prediction phase, there is the phase of characterizing a new workload running in isolation, which only occurs once for each new workload that has never been seen before by the system. During this phase, only data is collected, no training or prediction is done. The data that is collected is the KPI of the application (e.g., KPI_i=10000 requests per second), and the hardware performance counters that result from running this workload on an otherwise idle machine (e.g., PC_i=x). During the training phase, when a workload A (potentially concurrently with other workloads B, C, . . . ) runs on a cluster machine 10a, the hardware performance counters of machine 10a are collected (e.g., PC_ii=y) and are combined with the hardware performance counters measured when workload A was initially run on a machine on its own (PC_i+PC_ii=x+y), thus combining input values that characterize the cluster machine 10a and its current workload with input values that characterize the general behavior of the application. The KPI reached by workload A under these conditions is also measured (e.g., KPI_ii=5000 requests per second). For training of the ML algorithm, the combined hardware performance counters PC_i and PC_ii form the input, and either the raw KPI value KPI_ii, or the KPI worsening factor KPI_i/KPI_ii form the output labels. The ML algorithm can be defined to be trained using either raw KPI values or the KPI worsening factors. Thus, the input and the correct output that the ML algorithm should produce is given so that the ML algorithm can learn the desired output, adapt its weights and parameters, etc. In the prediction phase, the current hardware performance counters on each machine cluster machine 10a, 10b, 10c . . . 10n are collected (e.g., PC_ii, PC_iii, PC_iv . . . PC_n) are collected and each are combined with the hardware performance counters from the system on which the workload to be placed was run in isolation (PC_i). Using these inputs, the ML algorithm predicts a KPI (or KPI worsening factor) as output. According to the output, which cluster machine 10a, 10b, 10c . . . 10n to schedule the workload is selected (e.g., the one that shows the highest predicted KPI or the lowest KPI worsening factor). The workload placement decision can thereby be based on a proper prediction since both parts of the input are available before the workload placement decision is made.
The three phases can run concurrently or at different times. For example, data can be collected for new workloads, while training and prediction is taking place for other workloads. Further, the training and prediction phases can work together to produce online training. For example, a workload A is placed onto a cluster machine 10a according to the workload placement decision in the prediction phase. After it is placed, the KPI is measured, as well as the hardware performance counters of the cluster machine 10a which can be, for example, also concurrently running other workloads B and C. Those two values can then be used to create another input vector for the training phase by combining the measured hardware performance counters (e.g., PC_ii) with the hardware performance counters from when workload A was run on its own (e.g., PC_i), and the measured KPI of the workloads A, B and C running together can be used for the output in the training phase (e.g., KPI_ii).
Embodiments of the present invention provide for the following improvements and advantages:
a. Providing as input, according to one particular embodiment, to the machine learning model also the number of applications being run on the target machine.
b. Using an artificial neural network as a machine learning model.
c. Using performance counters that include CPU hardware counters.
d. Providing that the target of the prediction is the ratio between the expected performance of the application when running together with the target machine's workload and the performance of the application when running alone on a machine, or providing that the target of the prediction is the KPI value of the target application when running together with the target machine's workload.
According to an embodiment, the present invention provides a method comprising the following steps:
The quality of placement decisions can be improved with the amount of previously collected data about the behavior of co-located workloads. During an initial training phase necessary to create models, a legacy algorithm can be used.
Embodiments of the present invention could be deployed in Cloud and system platform markets, where more efficient placement decisions can reduce the amount of wasted resources and give customers higher performance and faster execution of their workloads.
While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive. It will be understood that changes and modifications may be made by those of ordinary skill within the scope of the following claims. In particular, the present invention covers further embodiments with any combination of features from different embodiments described above and below. Additionally, statements made herein characterizing the invention refer to an embodiment of the invention and not necessarily all embodiments.
The terms used in the claims should be construed to have the broadest reasonable interpretation consistent with the foregoing description. For example, the use of the article “a” or “the” in introducing an element should not be interpreted as being exclusive of a plurality of elements. Likewise, the recitation of “or” should be interpreted as being inclusive, such that the recitation of “A or B” is not exclusive of “A and B,” unless it is clear from the context or the foregoing description that only one of A and B is intended. Further, the recitation of “at least one of A, B and C” should be interpreted as one or more of a group of elements consisting of A, B and C, and should not be interpreted as requiring at least one of each of the listed elements A, B and C, regardless of whether A, B and C are related as categories or otherwise. Moreover, the recitation of “A, B and/or C” or “at least one of A, B or C” should be interpreted as including any singular entity from the listed elements, e.g., A, any subset from the listed elements, e.g., A and B, or the entire list of elements A, B and C.
Priority is claimed to U.S. Provisional Application No. 62/793,393 filed on Jan. 17, 2019, the entire contents of which is hereby incorporated by reference herein.
| Number | Date | Country | |
|---|---|---|---|
| 62793393 | Jan 2019 | US |