The present application claims the benefit of priority from Indian Provisional Application No. 202241036351 filed Jun. 24, 2022 and entitled “SYSTEM AND METHODS FOR DYNAMIC WORKLOAD MIGRATION AND SERVICE UTILIZATION BASED ON MULTIPLE CONSTRAINTS,” the disclosure of which is incorporated by reference herein in its entirety.
The present invention relates generally to management of computing environments and more specifically to dynamic migration of workloads between computing environments.
Amazon Web Services (AWS) and other cloud platform and computing environment service providers offer users workload migration capabilities, such as enabling migration of containers based on predefined time schedules or in response to events. However, such migration is static in the sense that when processing is initiated at the predetermined time or upon the occurrence of an event the processing is fixed (i.e., no migration of running processes is permitted). Thus, for example, an event may be defined and a rule configured that specifies when the event occurs a job should begin processing. Once the event occurs, the event serves as a trigger to begin processing per the corresponding rule, but that job cannot be modified or moved once processing begins.
Systems and methods supporting dynamic migration of workloads and workload processing between different execution environments are disclosed. The disclosed systems and methods provide functionality for monitoring execution environments to verify availability of sufficient computing resources, data residency constraints, renewable energy utilization, and other execution environment metrics (e.g., costs, carbon intensity or footprint, etc.). A job (e.g., processing of a workload) may be initiated at a particular execution environment (e.g., cloud platform, etc.) based on the monitoring, such as to initiate the job at an execution environment determined to be optimal with respect to one or more metrics.
The monitoring may continue after the job is initiated and at a subsequent time and while the job is still in progress, a determination may be made to migrate the job (e.g., processing of the workload) to a different execution environment that provides a more optimum configuration for the job with respect to the one or more metrics (e.g., optimized carbon impact, cost, etc.). When a determination to migrate to the second execution environment is made, operations may be initiated to configure the second execution environment to take over the job. After initialization, the second execution environment may be monitored for a stabilization period, which is configured to ensure the second execution environment is in a stable state prior to starting the migration of the job. After stabilization of the second execution environment is confirmed, the job may be migrated from the first execution environment to the second execution environment, at which time processing of the workload switches from the first to the second execution environment.
In some aspects, migrating from one execution environment to another may involve evaluating multiple available execution environment options, as opposed to merely switching between two execution environments. In instances where migration involves selecting one of many possible execution environments a conflict resolution process may be utilized to determine which execution environment should be selected for the migration of the job, which may take into account at least some of the metrics associated with the monitoring or other factors.
In addition to determining when to migrate, the disclosed systems and methods also provide functionality supporting forecasting techniques for performing migration between execution environments. The forecasting techniques may leverage historical migration information to predict when future migrations may occur or be advantageous. The ability to leverage such forecasting techniques may enable migrations to occur more efficiently and in the absence of any ability to observe metrics for migration analysis based on the monitoring.
For a more complete understanding of the present invention, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
It should be understood that the drawings are not necessarily to scale and that the disclosed embodiments are sometimes illustrated diagrammatically and in partial views. In certain instances, details which are not necessary for an understanding of the disclosed methods and apparatuses or which render other details difficult to perceive may have been omitted. It should be understood, of course, that this disclosure is not limited to the particular embodiments illustrated herein.
The present disclosure provides systems and methods supporting dynamic migration of workloads or containers between execution environments. The disclosed systems and methods may utilize monitoring and/or forecasting techniques to determine when a migration should occur. Upon determining a migration should occur, a target execution environment for a workload may be identified and a migration process may be initiated. In some aspects, the migration may be performed partway through processing of the workload and the migration may resume processing the workload after the migration is completed in a manner that enables the processing to resume at the point where processing stopped prior to the migration.
Referring to
Each of the one or more processors 112 may be a central processing unit (CPU) or other computing circuitry (e.g., a microcontroller, one or more application specific integrated circuits (ASICs), and the like) and may have one or more processing cores. The memory 114 may include read only memory (ROM) devices, random access memory (RAM) devices, one or more hard disk drives (HDDs), flash memory devices, solid state drives (SSDs), network attached storage (NAS) devices, other devices configured to store data in a persistent or non-persistent state, or a combination of different memory devices. The memory 114 may store instructions 116 that, when executed by the one or more processors 112, cause the one or more processors 112 to perform the operations described in connection with the migration device 110 with reference to
The one or more communication interfaces 128 may be configured to communicatively couple the migration device 110 to the one or more networks 170 via wired or wireless communication links according to one or more communication protocols or standards (e.g., an Ethernet protocol, a transmission control protocol/internet protocol (TCP/IP), an institute of electrical and electronics engineers (IEEE) 802.11 protocol, and an IEEE 802.16 protocol, a 3rd Generation (3G) communication standard, a 4th Generation (4G)/long term evolution (LTE) communication standard, a 5th Generation (5G) communication standard, and the like). The I/O devices 130 may include one or more display devices, a keyboard, a stylus, one or more touchscreens, a mouse, a trackpad, a camera, one or more speakers, haptic feedback devices, or other types of devices that enable a user to receive information from or provide information to the migration device 110. It is noted that while shown in
The migration engine 120 provides functionality supporting acquisition of migration parameters and constraints that may be used to control how migration is performed, when migration is performed, or other operations for controlling migration between different execution environments. The one or more sensors 122 may be configured to monitor different execution environment parameters, a status of in-progress workload processing of jobs, or other types of parameters or constraints that may be used to control migration operations. The monitoring engine 124 may be configured to monitor the sensor(s) 122 and communicate information associated with data collected by the sensor(s) 122 to the migration engine 120, such as information that may be used to determine whether to initiate a migration of a workload from an execution environment 150 to a different execution environment, such as execution environment 160 or execution environment 174. The request handler 126 may be configured to receive incoming jobs (e.g., workload processing requests) and may queue each received job for processing at one of the available execution environments.
As an illustrative and non-limiting example and referring to
As illustrated in
Once the second execution environment is determined to be in the stable state, at time (t)=4, the migration may be initiated. In an aspect, the migration may involve saving a state of the workload processing and then transferring state information to the second execution environment to enable the workload processing to be resumed in the second execution environment starting from the same point where workload processing stopped in the first execution environment. In another example, the workload processing may be restarted in the second execution environment. In such implementations, the threshold change required to initiate migration may be higher where the workload processing is restarted as compared to merely resumed from the point processing was stopped in the first execution environment in order to ensure that the benefit provided by the migration is not outweighed by the redundant processing required when the workload processing is restarted. Once the second execution environment is initialized and the migration is complete, at time (t)=5, the workload processing may be executed on the second execution environment and resources in the first execution environment may be freed up for other tasks or may become idle.
As shown in
Referring back to
For example, in some implementations there may be many potential execution environments suitable for processing a particular workload, each having particular metrics that the migration engine 120 of
In addition to the above-described functionality, which is based on monitoring the various execution environments, in some implementations the migration device 110 may also provide functionality for forecasting migrations. The forecasting operations may utilize machine learning techniques to predict or forecast when migrations would be beneficial. For example, historical migration data may indicate that migration between a first and second execution environment frequently occurs on a particular day, at a particular time, during a particular season, or some other criteria. Such observations may then be used to predict when a particular migration should occur and/or to schedule the migration. In some instances, such historical migration data may be stored in and/or retrieved from the one or more database 118, which may include a historical database maintaining values for metrics of interest observed over time.
In an aspect, clustering techniques may be leveraged to identify such migration operations. For example, historical data for a time period (e.g., the last month, last week, last X days, etc.) may be analyzed using a clustering algorithm to identify optimal migrations for workloads and/or service requests. As a result, future migrations may be predicted based on optimal performance of available execution environments according to one or more clusters, each identifying an optimal execution environment for processing a particular type of workload or service request. To illustrate, migration data for a previous 6 days for training of artificial intelligence workloads may be analyzed and 2 clusters may be generated. Each of the 2 clusters may predict an optimal execution environment for training of artificial intelligence workloads on the 7th day, which may indicate training of artificial intelligence workloads should be migrated to a first execution environment at a first time on the 7th day and them migrated to a second execution environment at a second time on the 7th day to obtain optimal processing performance.
The ability to use the above-described forecasting techniques to predict when migrations should occur, what target environments should be chosen for the migrations, or other migration parameters may enable the migration device 110 to operate without the sensors 122 and/or without monitoring the various environments, or at least not monitoring them as frequently, thereby providing a more independent system for managing migration between different execution environments. While capable of operating without monitoring where forecasting techniques are used, it is noted that in some implementations monitoring may be used in addition to the forecasting techniques, which may improve the results achieved for forecasted migrations due to enhanced datasets due to the monitoring.
Referring to
However, the change as to the third execution environment may exceed the threshold during the stabilization period, thereby establishing the third execution environment as another candidate execution environment. In such a situation, a conflict is presented whereby two alternative execution environments are viable candidates for user in a migration from the first execution environment. To resolve this conflict the modelling engine may reevaluate the second and third execution environments after the stabilization phase is completed to identify whether the second or third execution environment should be chosen for the migration from the first execution environment. As can be seen in
As shown in Table 2, at time t=0, two execution environment (e.g., Alt1 and Alt2) may be monitored and Ala may be ranked higher than Alt 2 due to Alt 1 having a higher monitored metric (e.g., a higher percentage of green energy utilization, lower carbon intensity, etc.). As such, at time t=0 Alt1 may be the preferred execution environment. At time t=1 Alt2 may exhibit an improved monitored metric, resulting in Alt2 being the higher ranked execution environment and Alt1 becoming the lower ranked execution environment. In an aspect, the difference in the performance metric between Alt1 and Alt2 at time t=1 may be below a migration performance metric and so migration from Alt1 to Alt2 may not be initiated (e.g., because the performance improvement may be insufficient to justify migration). For example, the migration performance metric may specify a threshold performance increase (e.g., 5%, 10%, 15%, 20%, etc.). As an additional or alternative example, the migration performance metric may specify that migration should not occur if a current workload or service request has reached a threshold completion level (e.g., 80%, 85%, 90%, 95%, etc.) such that completion of processing of the workload or service request may be more efficiently completed (e.g., from a processing and computational resources perspective) on a current execution environment rather than being migrated. It is noted that in some aspects, multiple migration performance metrics may be considered, such as those described above or other metrics when determining whether to migrate to a new execution environment.
At time t=2, a third execution environment (e.g., Alt3) may begin being monitored and may be the third ranked execution environment. Additionally, at time 1=2 Alt2 may satisfy the migration performance metric(s), which may initiate a stabilization period to determine whether the performance metric(s) of Alt2 is stable (i.e., not a temporary occurrence). At time t=3, Alt3 may overcome Alt1 (i.e., the current execution environment) to become the second ranked execution environment and may satisfy the migration performance metric, which may initiate a stabilization period for Alt3. At time t=4 Alt3 may become the highest ranked execution environment and Alt2 may complete the stabilization period. As described herein, when the stabilization period ends, and assuming the performance is stable, a stabilization grace period may begin, which is a period of time to allow conflict resolution to occur. In the example shown in Table 2 above, the conflict being resolved during the stabilization grace period may be the conflict between Alt2 having completed the stabilization period but Alt3 being the highest ranked execution environment, but in the stabilization period. At time t=5, Alt3 completes the stabilization period and remains the highest ranked execution environment. At time t=6, the stabilization grace period may end and Alt 3 may remain the highest ranked execution environment. As a result, migration of a workload or service request may be migrated to Alt3.
At time t=7, migration to Alt 3 may be completed and a cooldown period may be initiated. As explained above, the cooldown period may be a period of time in which no migrations are performed to avoid migrating too often, which may be inefficient. At time t=8, the cooldown period may end and the system may begin determining whether the migrate the workload or service request from Alt3. However, since Alt3 remains the highest ranked execution environment at time t=8 no migration operations may be initiated. As shown in the example above, embodiments of the present disclosure may enable conflict resolution processing to be performed to account for conflicts that may arise during migration determinations and may enable the most efficient processing of workloads despite the occurrence of a conflict.
Referring to
If attempts to retry or reinitialize processing of the workload on the first execution environment according to the processing recovery parameter(s) fail, the workload (or service request) may be migrated to the next highest ranked execution environment in accordance with the techniques described herein, such as the third execution environment at time (t)=3 in this example. If the workload is interrupted or fails after or during migration to the third execution environment, such as at time (t)=4, attempts to resume or reinitialize the processing may be performed according to one or more processing recovery parameters as described above. However, if such attempts are unsuccessful, a determination to migrate the workload (or service request) to the first or second execution environments may be performed. As described herein, such a determination may be based on a ranking of the execution environments for migration suitability using one or more migration metrics.
It is noted that in some of the examples above, the migration metrics considered when determining to migrate processing of workloads and/or service requests relate to singular metrics, such as a percentage of renewable energy. However, in some aspects, determinations to migrate processing of workloads may account for multiple different migration metrics and types of migration metrics. For example, an execution environment may be ranked highly for a particular metric (e.g., a metric representing an environment's utilization of renewable energy), but otherwise is not suitable for executing a workload, the system may determine not to migrate the workload to that environment. For example, suppose that an amount of renewable energy is determined to be a preferred particular metric for a workload, and an execution environment ranks highly for the amount of renewable energy it consumes, but ranks poorly for other monitored metrics such as cost, reliability, efficiency, or processing time. Such an execution environment may not be desirable for migrating workloads. In an aspect, execution environments may have their failures monitored, tracked, and or stored, such as in a database (e.g., the one or more databases 118 of
In the example shown in Table 3 above, migration suitability for three execution environments (e.g., Alt1, Alt2, Alt3) are shown over 5 time periods from t=0 to t=4. Initially, at t=0, Alt1 may be the highest ranked execution environment based on the three monitored metrics, which include cost, eviction rate (%), and renewable energy utilization (%). However, at time t=1 the cost of Alt1 may increase significantly and Alt2 may decrease approximately 20% (e.g., from $5.67 to $4.67). Based on the weightings of the various metrics, Alt 2 may become the highest ranked execution environment and processing may be migrated to Alt 2 using the concepts described herein (e.g., stabilization period, grace period, cooldown, etc.). At time t=2 Alt 1 may again become the highest ranked execution environment based on the monitored metrics and associated rankings and processing may be migrated back to Alt1. However, the processing of the workload may fail at Alt1 and may be migrated to Alt2 following completion of the process recovery operations at Alt 1.
At time t=3 the workload or service request may be evicted by Alt1. As explained herein, when the eviction occurs process recovery operations may be performed. In the example shown Table 3, the process recovery operations may be unsuccessful and the workload or service request is migrated to Alt3. At time t=3 the workload or service request may be evicted by Alt3. As explained herein, when the eviction occurs process recovery operations may be performed. In the example shown Table 3, the process recovery operations may be unsuccessful and the workload or service request is migrated to Alt2.
As shown above, embodiments of the present disclosure provide robust migration capabilities that may account for multiple monitored metrics to identify an optimal execution environment and may also provide for failure recovery and other aspects of the migration and processing of workloads to ensure workloads and service requests are processed in an optimal manner. In a similar manner to the dynamic migration processes discussed above relative to
Referring to
At step 510, the method 500 includes initiating, by one or more processors, processing of a job at a first execution environment. As explained above, the job may include a workload, such as training of an artificial intelligence model, or may include a service request. At step 520, the method 500 includes monitoring, by the one or more processors, the first execution environment and a second execution environment. As explained above, the monitoring may be configured to evaluate each execution environment of a plurality of execution environments with respect to one or more metrics (e.g., utilization (%) of green or renewable energy, performance metrics such as failure or eviction rates, job processing completion percentage, etc.). At step 530, the method 500 includes determining, by the one or more processors, to migrate processing of the job to the second execution environment based at least in part on the monitoring. As described elsewhere herein, it should be appreciated that step 530 may additionally include determining whether to migrate the job to other execution environments, rather than just determining between the first and second execution environments. Additionally, it is noted that determining to migrate the processing of the job may also include other operations described herein, such as verifying the stability of an execution environment, conflict resolution, and the like. At step 540, the method 500 includes migrating, by the one or more processors, processing of the job from the first execution environment to the second execution environment. It is noted that the method 500 may include additional operations consistent with the operations described above with reference to
It is noted that the exemplary use cases described above, as well as the processes, methods, and techniques for controlling migration of processing between multiple execution environments have been primarily described with respect to migration between 2 or 3 execution environments. However, that these use cases have been described for purposes of illustrating the migration techniques disclosed herein, rather than by way of limitation, and it should be understood that the migration techniques described throughout this disclosure may be applied to any number of execution environments (e.g., 2 or more). Indeed, such techniques may become more useful, applicable, and necessary as the number of potentially viable alternative execution environments increases. For example, applying techniques like those described herein may enable more effective workload migrations by providing new capabilities to optimize where workloads and service requests are processed based on multiple optimization factors (e.g., utilization (%) of green or renewable energy utilized by an execution environment, failure/reliability and other performance metrics, workload processing completion status, and other factors, such as processing failure recovery operations). Workload effectiveness As a result, effectiveness of processing of workloads and/or service requests may increase as an optimal execution environment for a given workload may be quickly identified from several potential execution environments, ranked, and migrated to.
The benefits of applying migration techniques such as those described in this disclosure are numerous. For example, migrating workloads to execution environments ranking highly for renewable energy use may reduce the carbon footprint of the workload. In addition to the environmental benefits, reducing a workload's carbon footprint may enable greater consumer satisfaction or peace of mind, which may provide business advantages. Another exemplary benefit may come from performing migrations into execution environments with greater processing capability and/or more computing resources available may improve processing and/or workload performance. This may increase processing speed, saving time and costs associated with longer runtimes. Yet another exemplary benefit of the techniques described herein may include providing more robust failure recovery capabilities. For example, as was discussed above related to
Although the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims. Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification. As one of ordinary skill in the art will readily appreciate from the disclosure of the present invention, processes, machines, manufacture, compositions of matter, means, methods, or steps, presently existing or later to be developed that perform substantially the same function or achieve substantially the same result as the corresponding embodiments described herein may be utilized according to the present invention. Accordingly, the appended claims are intended to include within their scope such processes, machines, manufacture, compositions of matter, means, methods, or steps.
Moreover, the scope of the present application is not intended to be limited to the particular embodiments of the process, machine, manufacture, composition of matter, means, methods and steps described in the specification.
Number | Date | Country | Kind |
---|---|---|---|
202241036351 | Jun 2022 | IN | national |