Embodiments of the present invention generally relate to data protection and data protection operations. More particularly, at least some embodiments of the invention relate to systems, hardware, software, computer-readable media, and methods for protecting data including data replication for multiple applications.
Data protection systems can protect data in a variety of different manners. The process of protecting data generally provides the ability to backup and recover data. However, there are a variety of data protection operations and many ways to backup/recover data.
One of the concerns related to backing up data, in addition to having a valid backup, is the ability to restore the data. Recovering applications or failing over applications allows the applications to resume operation. However, the longer it takes to perform a recovery operation or a failover operation, the greater the damage to an entity.
One of the metrics used to characterize the restore operation is Recovery Time Objective (RTO). The RTO indicates a point in time at which operation may resume or the point in time at which the recovery operation is completed. Data protection operations often seek to find ways to reduce the RTO.
In order to describe the manner in which at least some of the advantages and features of the invention may be obtained, a more particular description of embodiments of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the invention and are not therefore to be considered to be limiting of its scope, embodiments of the invention will be described and explained with additional specificity and detail through the use of the accompanying drawings, in which:
Embodiments of the present invention generally relate to data protection operations including data backup and restore operations. More particularly, at least some embodiments of the invention relate to systems, hardware, software, computer-readable media, and methods for data protection and to performing data protection operations. Examples of data protection operations include, but are not limited to, backup operations, replication operations, any point in time (PiT) recovery operations, journaling operations, restore or recovery operations, disaster recovery operations, failover operations, failback operations, or the like or combination thereof. Embodiments of the invention may be implemented in a variety of different networks and configurations including physical and/or virtual systems, virtual machines (VMs), local area networks, wide area networks, cellular networks, cloud-based systems and networks including datacenters, or the like or combination thereof.
In general, example embodiments of the invention relate to systems and methods for protecting data in a manner that improves Recovery Time Objective (RTO) while also managing cost. A PiT data protection system may replicate IO (Input/Output) from a production site that is associated with or includes multiple virtual machines to a replica site that may include multiple replica virtual machines. In the PiT data protection system, when operating, the replica virtual machine disks are updated with new data that has been written to the production virtual machine disks. Access to the replica virtual machine disks is typically blocked by a software component such as a splitter to avoid inconsistencies and unwanted data changes.
In one example, the replica virtual machine may be powered off or in a “shadow” mode. In shadow mode, the virtual machine disks or volumes exist and can be written to, but the operating system (OS) is not running. Recovering from the replica virtual machine or failing over to the replica virtual machine will require powering on the virtual machine and booting the OS. This is a process that may take significant time. Even though a data protection system may protect the whole virtual machine, it is often advantageous to protect a production application, which typically runs on the production virtual machine.
When operating in shadow mode, the RTO may include both the start up time of the virtual machine, the boot time of the OS and the start time of the application. Embodiments of the invention reduce the RTO and the associated costs.
In
The replica virtual machine 112 is associated with virtual disks 114, 116, and 118. During replication, the production data written to the virtual disks 106 and 108 may be journaled in a journal 110 and written to, respectively, virtual disks 116 and 118. Thus, the applications 122 and 124 are effectively replicated to the replica virtual machine 112. Binaries of the applications 122 and 124 may be stored or installed on the virtual disk 114, which is also an OS disk for the replica virtual machine 112. The applications 122 and 124 could also be stored on the data virtual volumes or disks 116 and 118.
To improve the RTO,
The RTO, in this example, allows an RTO of seconds to be achieved instead of minutes by removing the need to configure hardware, POST, boot the OS, perform network discovery and connection, login or the like.
In one example, virtual machines that only have a single disk can often be configured to have a supported configuration or configured with a different configuration. This may include adding at least one more virtual disk, formatting the added disk with a file system and setting the application data source to reside on the added disk. The examples discussed herein may include a single OS disk, but embodiments of the invention are not limited to a single OS disk. However, the OS and an application typically reside on a predefined fixed number of disks. These and/or the data disks can be tagged for identification purposes. Further, references to virtual disks may also include references to virtual volumes.
The data protection system of
Embodiments of the invention optimize or reduce this cost by sharing replica virtual machines with several applications. In other words, multiple applications may be replicated to the same replica virtual machine. Embodiments of the invention create a relationship between availability (e.g., in the context of a recovery operation or the like) and cost that can be balanced. Embodiments of the invention relate to static and/or dynamic strategies for optimizing the manner in which virtual machines are shared and the manner in which applications are protected. Balancing the relationship between cost and availability can impact performance when a disaster or other problem occurs.
Replicating a virtual machine or replicating an application often have the same effect. The actual data replicated may depend on implementation. However, replicating a virtual machine or replicating an application typically conveys replicating the data associated with the application. The OS and application binary may already be present on the replica virtual machines at the replica site and do not necessarily need to be replicated.
A data protection system 240 may be configured to replicate virtual disks 206, 212, and 218 (or the changes made to these virtual disks once a baseline is established) to replica virtual disks 226, 228, and 230. The virtual disk 224 may include an OS and binaries of app1, app2, and app3. Further, the OS on the virtual disk 224 may or may not be the same as the OSes on the virtual disks 204, 210, and 216. As long as a similar OS is sufficient for app1, app2, and app3, various sharing strategies can be implemented. If the OS version or type is a hard dependency, this may impact the manner in which a replica virtual machine is used for replication. More specifically, the OS system may impact which of the apps and production virtual machines can share the same replica virtual machine when the OS is a hard dependency.
More generally, a typical cloud or datacenter may have large numbers of applications running on separate virtual machines. As a result, the replica virtual machines can share the applications of a production site in a large number of combinations. Some replica virtual machines, for example, may contain M applications while other replica virtual machines may contain fewer or more applications. In some instances, embodiments of the invention ensure that a replica virtual machine only has a single application replicated thereto.
Next or if the virtual machines at the replica site have been previously installed/deployed, a strategy is selected 304 for replicating production apps and/or their data to the replica virtual machine. The strategy may include a QoS strategy 306, a failure domain aware strategy 308, or a related application or inter-dependent strategy 310. In some examples, more than one strategy may be used to replicate the production virtual machines. These strategies are discussed later.
A replication engine of a data protection system (e.g., RP4VMs) may replicate 312 production data virtual disks or volumes of the M binaries stored on the replica virtual machine to corresponding disks or volumes on the replica virtual machine. As previously selected, the M binaries or apps may be selected according to a strategy. The apps replicated to the replica virtual machine may be in separate consistency groups by way of example only. As a result, each app may be associated with a separate any-PiT journal and may be recovered independently of other apps.
In some examples, the application and its data may be stored on the same virtual disk. If the applications are not installed together with their data, the applications 1-M may be installed on the OS disk. In some embodiments, the apps are typically turned off. If supported, the app may be in a running state which allows the replica virtual data disk to be mounted at a later time.
As a result, the M production applications are protected to one replica virtual machine and the M applications share the same replica virtual machine.
When applications share a replica virtual disk, the apps protected to the same replica virtual machine should be able to run together. For example, the apps on the replica virtual machine should use the same or similar OS, on which they can run. In addition, the applications should support the same hardware with which the replica virtual machine (e.g., CPU architecture) is configured. If supported by the application, more than one instance of the application can run on the replica virtual machine. If not supported, multiple instances of the application will not be placed on the same replica virtual machine in one example.
Embodiments of the invention may also consider licenses when replicating production applications. For example, an application license may limit an application to a specific hardware ID of a virtual machine. In this case, the application cannot reside on the same replica virtual machine as another application with the same restriction because there is only one hardware ID for the replica virtual machine. However, there are cases where more than one instance of a hardware ID exists on a replica virtual machine. For example, MAC addresses with multiple network cards and hard disks smart ID with multiple virtual disks may allow multiple apps to be supported.
Embodiments of the invention may also consider networking constraints. A virtual machine may have several network cards (NICs) and more than one IP address per NIC. If an application requires a specific IP address, the address can be attached to the replica virtual machine when the application is run. In addition, applications that serve incoming traffic (e.g., web pages), can bind to different IP and port pairs and can work together. If two applications do not work together (e.g., they must bind a specific port on the first NIC) these applications are not placed on the same replica virtual machine. These constraints are typically considered when replicating production applications to replica virtual machines.
By way of example, a replication strategy may generate a topology. More specifically, the topology suggests which production virtual machines are replicated to which replica virtual machines. The topology may describe the manner in which the replica disks are organized. For example, replica disk 1 is used to replicate app1 including data1, replica disk 2 is used to replicate app2, app3, app4 including data2, data3, and data4, etc.
Replication Strategies—Quality of Service
In this example, the production virtual disks 420 include virtual disks 408, 410, 412, and 414. The disk 408 includes an app1 and/or data1. Similarly, the disks 410, 412, and 414 include, respectively, app2/data2, app3/data3, and app4/data4. However, the data and the associated app may be on separate disks. For example, the app may be on the OS disk as previously discussed. In one example, the data is replicated because the binaries of the applications may not need to be replicated and may be stored on the relevant disk.
The QoS strategy, when implemented, causes the replication engine 404 to replicate the disk 408 to the replica disk 416. The replication engine 404 replicates the disks 410, 412, and 414 to the same replica virtual disk 418. Thus, app2, app3, and app4 share the same replica virtual disk 418.
The QoS strategy illustrated in
In other words, to achieve higher performance, at least in terms of recovery, the replica virtual disk 416 is not shared with any other applications and is configured to support a single application. Alternatively, certain replica virtual disks may have a limit on the number of applications that share those replica virtual disks. In one example, the applications may each be associated with a value M. The M value of a first application indicates how many other applications the first application can share a replica virtual machine with. If the app1 has an M value of 1, this rule indicates that the app1 should have a dedicated replica virtual disk, as illustrated in
Using a single replica virtual machine for several applications has cost benefits when production virtual machines are operating properly, but has performance consequences when recovery is required. As a result, more important apps are typically replicated to dedicated virtual machines or to virtual machines that are limited in the number of applications replicated thereto.
The assessment 502 is able to divide the applications being replicated into subsets. Each subset is replicated to a corresponding replica virtual machine. If a subset includes a single application, then that application is the only application replicated to a particular replica virtual machine. If a subset includes 3 applications, those 3 applications are replicated to the same replica virtual machine.
Once the assessment is completed, the data protection system may be configured 504 to replicate the applications in accordance with the strategy. This may include deploying and installing the replica virtual machines, configuring or reconfiguring existing replica virtual machines, or the like.
Next, the production virtual machines (or their applications and data) are replicated to the replica virtual machines. Often, the application may not need to be replicated. Rather, the binary can simply be stored on the replica virtual disk of the virtual machine along with the OS. Typically, replication involves replicating the IOs that occur with respect to an application's data.
Replication Strategies—Failure Domain Aware Strategy
As illustrated in
In this example, the failure domain detector 602 may collect information about the virtual machines 610, 616, and 622. The failure domain detector 602 may collect, byway of example only, virtual machine information 604, datastore information 606, and hypervisor information 608. The collection may be repeated such that the assessment can be reevaluated at times.
By way of example only, the virtual machine information 604 may include IP information for each of the virtual machines. The datastore information 606 may include or identify the physical datastores associated with the virtual machines. The hypervisor information 608 may include or identify the host hypervisors for the virtual machines 610, 616, and 622.
The replication engine 640 may evaluate, assess, or analyze the virtual machine information 604, the datastore information 606, and the hypervisor information 608 to determine how the virtual machines 610, 616, and 622 or the app1, app2 and app3 and associated data1, data2, and data3 should be replicated. The replication engine 640 takes this information into account when generating recommendations on how to perform replication or when replicating.
More specifically, the replication engine 640 may recommend that virtual machines belonging to the same failure domain should be replicated to different replica virtual machines. Alternatively, the replication engine 640 may recommend that virtual machines belonging to different failure domains be replicated to the same replica virtual machine.
A failure domain, by way of example, indicates how a failure may impact the virtual machines and/or applications. More specifically, the replication engine 640 may evaluate the information 604, 406 and 608 to determine which of the production virtual machines belong to the same failure domain. For example, the replication engine 640 may determine, from the information 604, 606, and/or 608, that the virtual disk 612 and the virtual disk 618 (OS disks for virtual machines 610 and 616) are both allocated from the same datastore. As a result, a problem with the datastore X (e.g., device failure) will impact both the virtual machine 610 and the virtual machine 616. In this case, the replication engine 640 may decide to replicate the virtual machines 610 and 616 to different replica virtual machines. Similarly, virtual machines that share the same hypervisor are in the same failure domain and may be replicated to different replica virtual machines. In effect, virtual machines that are impacted by a specific event may be in the same failure domain, whether the event relate to a physical device, a hypervisor issue, an IP or network concern, or the like.
More specifically, if M production virtual machines are in the same failure domain (e.g., deployed on the same hypervisor system, share the same datastore), these virtual machines are likely to experience problems or disasters at the same time. For example, when a datastore has a hardware problem or a connectively issue or a hypervisor fails and is unavailable, all of the production virtual machines or applications are affected and may need to be failed over. If these production virtual machines have been replicated to the same replica virtual machine, then the performance of the failover or recovery operation is adversely affected by the need to recover or failover multiple applications at the same time on the same replica virtual machine.
By accounting for failure domains when replicating, the performance of the recovery or failover operations is improved. More specifically, if these virtual machines in the same failure domain are replicated to different replica virtual machines, the recovery or failover performance improves.
More generally, the failure domain strategy identifies virtual machines, applications and/or application data that would be affected should a problem occur with regard to hardware, software, hypervisors, networking components or characteristics or the like. For example, virtual disks on the same datastore would likely all be impacted by an issue with the datastore. Applications or operating systems on the same hypervisor would all be impacted by an issue with the hypervisor.
The assessment 702 identifies subsets or groups of applications/virtual machines/application data based on these types of failure domains. Generally, this assessment ensures that the applications/virtual machines/application data in the same failure domain are not assigned or replicated to the same replica virtual machine. Similarly, applications/virtual machines/application data that are not part of the same failure domain may be allowed to share the same replica virtual machine.
This strategy may improve the performance of any failover or recovery operation. If applications from the same failure domain are on the same replica virtual machine, the recovery is likely slowed because multiple applications are being failed over or recovered on the same replica virtual machines. If the applications on a replica virtual machine are associated with different failure domains, then failover or recovery may not involve multiple applications on the same replica virtual machine.
Next, the replica virtual machines are configured 704 based on the strategy. Thus, each replica virtual machine may be prepared or reconfigured, if necessary, to replicate a subset of the production applications. Next, the production virtual machines (or application data) is replicated 708 based on the strategy from the production virtual machines to the replica virtual machines.
Replication Strategies—Related Application Strategy
By way of example, the applications of a production site may be related to or depend on other applications. Further, the applications of a production site may also be distributed. Because these applications are related or interdependent, it is often useful to failover all of the related applications, even when only one of the applications fails or has an issue.
However, failing over only the virtual machine 840 (or the app6) may create a latency between the applications running on the production site and the application running on the replica site. This latency may be unacceptable for various reasons. As a result, it may be desirable to failover all of the related applications.
As illustrated in
In this example, if any of the app1, app3, and app6 require failover, all of the related apps are failed over. This improves performance at least because the related applications are all operating at the failover or replica site. This allows the production site to become the replica site as well in one example. This allows the related applications to failback if necessary or when needed.
Thus, as illustrated in
Similarly, the data2 is replicated to the virtual disk 856 of the virtual machine 850 and the virtual disk 852 may include the OS as well as binaries of app1 and app2. Also, the data5 may be replicated to the virtual disk 874 and the virtual disk 872 of the virtual machine 870 may be an OS disk and store binaries of app5 and app6.
In some examples of the replication strategies, the applications on the OS disks may be operational. This shortens the RTO and allows the data disks to simply be mounted during failover or disaster recovery or for other event.
If the related apps are all replicated to the same virtual machine, the performance of the overall application may suffer from a performance perspective due to low resources. Further, failing all of the related applications to the same replica virtual machine may also impact the performance of the failover operation because a single virtual machine is performing failover for multiple related applications at the same time.
Dynamic Strategy
Embodiments of the invention further relate to strategies including replication strategies that relate to fixed data protection topologies and/or changing data protection topologies. The replication strategies discussed herein can be static and/or dynamic. In addition, embodiments of the invention also allow these strategies to adapt to changing environments. More specifically, embodiments of the invention include or relate to data protection systems that adapt to changes in the computing environment or in the data protection environment or to changes in the machines or systems being protected. The data protection system, by way of example, is configured to dynamically and/or automatically change or adapt the replication performed by the data protection system. In the context of replicating virtual machines, changing or adapting to the topology of the production site/production virtual machines includes making changes to the relationships between the production virtual machines and the replica virtual machines. By way of example, a change in topology refers to changes the manner in which applications are replicated. Changes in the topology may include, by way of example and not limitation, the addition or removal of an application, the migration of an application to a different server, changes in storage, hypervisors, domains, user input, user designated priorities, or the like or combination thereof.
Dynamic Strategy—Adapting to Changing Environment
In one example, dynamically and/or automatically adapting to a changing environment using, for example, a QoS Strategy, a failure domain aware strategy and/or a related application strategy, includes changing the manner in which the applications are distributed or replicated to the replica virtual machines. This may include reallocating virtual machines, adding new replica virtual machines, migrating virtual machines to different servers or hosts, changing which applications are replicated to which replica virtual machines, or the like or combination thereof.
For example, an entity or user may escalate an application A to a higher priority from a lower priority (higher importance from lower importance). In one example and with reference to
For example and with reference to
This allows the data protection system 402 to dynamically change the topology of the replica virtual disks or replica virtual machines. In this example, the change in the importance or priority of an application was not necessarily reflected in a change in the topology of the production virtual disks 420 or the replica virtual disks 422. However, the data protection system 402 changed the topology, arrangement, or distribution of the replica virtual disks 422 in response to the change in the priority, which was reflected by way of example only as a change (e.g., deleted rule, modified rule, new rule) in the rules 406.
When the strategy includes a failure domain aware strategy and with reference to
For example, load balancing requirements may result in a production virtual machine being migrated from one physical server to another physical server. This type of change in the topology of the production virtual machines may change the assessment performed with respect to failure domains. Migrating a virtual machine from one physical server to another physical server may change, for example, the hypervisor associated with the migrated virtual machine. As previously described, the failure domain aware strategy accounts for the hypervisors on which the virtual machines operate. Consequently, the assessment that is used to replicate virtual machines may be impacted by the migration of the virtual machine. The replication engine 640 may, in this example, periodically reassess the replication deployment and configuration or reassess in response to a detected change or in response to a notification regarding the change. Alternatively, the replication engine 640 may be configured to detect and respond dynamically and automatically to changes that impact the failure domain assessment.
The replication engine of the data protection system may similarly be configured to adapt to changes when the replication strategy focuses on related application. Changes in the topology of the production virtual machines may require a change in the replication of the virtual machines. The data protection system can monitor for these changes or receive input from a user related to the application topology. In response, the data protection system can move the replicated applications and/or their virtual disks between replica virtual machines, to new virtual machines, or the like as needed.
In another example, the data protection system may monitor the replica virtual machines. When one or more of the replica virtual machines become loaded with failed over applications, embodiments of the invention may scale out and dynamically add additional virtual machines. The applications can be redistributed using the added replica virtual machines. A similar assessment as previously discussed may be performed to determine how the applications are replicated in this example.
Dynamic Strategy—Failure Prediction
Including failure prediction to manage replication allows the data protection system to operate with fewer replica machines when no failure is anticipated and to run more replica virtual machines when failure is predicted. Embodiments of the invention may incorporate artificial intelligence or machine learning to predict failures or other circumstances that may benefit from changing the manner in which applications are replicated. Predicting failures gives the data protection system advance warning of a potential problem and the opportunity to respond to the potential issue. In some examples, the problem indicated by machine learning may be avoided by taking action based on machine learning or other statistic based models. If adding more virtual machines and replicating some of the applications to the added virtual machines results in a prediction that the replication will not fail, this type of action can be taken. In one embodiment, the model may not only predict the failure but may also predict a potential solution to the failure or to prevent the anticipated failure. The results of the machine learning can be used by the data protection engine to manager the deployment of the virtual machines (e.g., add, remove), move applications and/or their data to different virtual machines, or the like or combination thereof.
In one example, the data protection system (or a separate prediction engine that interfaces with the data protection system) may be configured to collect system runtime statistics as training features for the model. The system runtime statistics may include, but are not limited to, disk IO latency and change rate, network IO latency and change rate, CPU/memory utilization, IO error numbers and change rate, and log file size increase rate. This data and any time series generated from this data can be used to train models such as fail-stop models or fail-slow models before the disaster or other issue occurs.
In this example, the replication engine 1004 may include a prediction engine 1006 that is configured to predict failure or problems at the replica site or with respect to the virtual machines and/or applications at the replica site. However, the replication engine 1004 may also include a model that accounts for characteristics at the production site, such as an increased workload, increased resource consumption, workload schedules, or the like.
The replication engine 1004 (or the prediction engine 1006) may collect or receive statistics or characteristics (generally referred to herein as statistics herein) from the replica site and/or the production site that may include operations statistics. The prediction engine 1006 may use the statistics 1012 and/or time series 1014, which may be generated from the statistics that are sampled or collected over time. This statistics 1012 and/or time series 1014 may be used to train a model to predict failure of the replica site, the production site, or of the replication occurring therebetween.
The prediction engine 1006 may also support triggers 1008. The triggers 1008 may include information about the weather or other threats (e.g., a hurricane notification) that can be used to adjust the topology of the replication virtual machines or the specifics of the replication strategy (which applications are replicated to which virtual machines).
In one example, the model of the prediction engine 1006 may output a value based on the input (e.g., the current statistics or current time series or triggers). If the value output by the prediction engine 1006 is above threshold value, which indicates a sufficiently high risk of threat, the replication engine 1004 may reconfigure or adjust the topology of the replica virtual machines 1010. This allows the replication engine 1004 to spread applications impacted by the potential threat on more replica virtual machines. If the value subsequently lowers below the threshold value, the replication engine 1006 may consolidate applications to fewer virtual machines.
In another example, the prediction engine 1006 may identify which virtual machine is expected to fail first and may identify potential remedies.
Because the model is based on machine learning, the model can identify specific characteristics (e.g., IO latency) that, when slow, lead to a problem with specific components or virtual machines or with the replication system as a whole. Further, the model of the prediction engine 1006 can also identify relationships, including non-linear relationships among the statistics of the replica virtual machines 1010 that may lead to a problem. Examples of failure prediction models are implemented in, by way of example, Dell EMC CloudIQ, Dell EMC eCDA, VMware Logs Insight and more.
The model or machine learning algorithm can account for the statistics discussed herein, and model time series using, by way of example, ARIMA and Holt-Winters algorithms. By way of examples, these algorithms can receive input such as seasonality (e.g., expected load times) or they can be used to find the best seasonality. Seasonality can be hourly, daily, weekly, for instance. The best model may also find the most appropriate seasonality. This allows the topology to be changed based on the seasonality or the anticipated seasonality.
When implementing strategies to improve or reduce RTO dynamically, by way of example only, embodiments of the invention (e.g., the replication engine or, more generally, the data protection system) is configured to move, add, remove, and/or rearrange applications on the replica virtual machines (and/or the production virtual machines) based on the new or updated assessment that is performed in accordance with the current replication strategy.
Before the change in topology, the replica virtual machine 1126 includes an OS disk 1128 that also includes app1 and app2 and the data disk 1130 for data1 (data of app1) and the data disk 1132 for data2 (the data of app2). Similarly, the replica virtual machine 1134 includes OS disk 1136 with app3 and app4 and the data disk 1138 for data3 (data of app3) and data disk 1139 for data4 (data of app4).
In this example, virtual machines 1126 and 1140 are the same replica virtual disks at different points in time (before/after the change in topology). Virtual machines 1134 and replica virtual machine 1146 are also the same replica virtual disks at the corresponding points in time. The virtual disks have a similar correspondence.
Alternatively, the change in the topology may be in response to a detected change with respect to any of the strategies discussed herein. For example, a change in the quality of service may cause the data protection engine to have the app2 and its data on a dedicated virtual machine. Similarly, changes in the failure domain, changes in the related applications, exceeding a threshold predictive value, or the like may result in a topology change.
In order to move app1 from the virtual machine 1140 to the virtual machine 1146, embodiments of the invention may use solutions such as Chef or Puppet. Further, the IP addresses and credentials may be used in order to configure the installation of app1 on the virtual machine 1146. After the app1 is successfully installed on the disk 1148, the app1 may be uninstalled from the virtual machine 1140. Uninstalling the app1 may be optional, but aids in clearing disk space. After successfully installing the app1 on the disk 1148, the data protection system may move the data disk 1130 from the virtual machine 1140 to the virtual machine 1146. This should not impact the OS disk 1142 at least because the data disks on the replica virtual machines are not in use as replicas.
The virtual machine 1210, prior to the change, included the OS disk 1212, which also includes app3 and app4. The virtual machine 120 also included data disks 1214 for data3 and 1216 for data4. After the change, the portable disk 1206 was added. After the change in topology, the disk 1206 is no longer associated with the virtual machine 1202 as illustrated in
In one example, a data protection may be replicating applications. For example, four applications may replicate to the same replica virtual machine. In one example, two of these applications may fail over to the replica virtual machine. In this example, if another application failed over, the resources of the replica virtual machine may not be able to handle the resource demand and, as a result, performance will suffer.
This type of situation may be detected by the machine learning model. However, the replication strategy may scale out by adding a new virtual machine. The new replica virtual machine may be for the applications that have not failed over. As previously described, migrating a replica application is distinct and potentially easier than migrating a running application. This keeps the RTO very low and provides high performance.
Scaling out the replica virtual machine may be achieved by cloning the existing replica virtual machine. More specifically, only the OS disk is cloned—not the data disks in one example. Next, the applications that have not failed over are moved to the new virtual machine. In one example, the cloning process may use snapshots.
In another example, the scaling out process may be performed using a template virtual machine, which can be spawned on demand. When using a template virtual machine, the OS disk that contains the applications does not need to be cloned and there is no need to uninstall any applications in the new virtual disk. The template virtual machine may be created as part of an original configuration. The template virtual machine may be used to create a replica virtual machine. Next, a desired OS is installed.
After installing the OS, the procedure can follow the processes previously described. Non portable applications, for example, can be installed on the replica virtual machine and replication can be performed. If the replica virtual machine becomes heavy (e.g., too many failed over applications, too many applications replicating to the same virtual machine), a second replica machine may be created from the template virtual machine. For non-portable applications, the applications that have not yet failed over are installed on the second replica machine. The disks from the initial replica virtual machine that belong to applications that have not failed over are detached from the initial replica virtual machine and then attached to the second replica virtual machine. The non portable applications are uninstalled from the initial virtual machine. Finally, replication continues for the applications moved to the second replica virtual machine.
Embodiments of the invention thus relate to a cost-optimized true zero RTO solution based on dynamic replication strategies. Embodiments of the invention relate to dynamic replica virtual machine deployment optimization that is adaptive to environment changes. Embodiments of the invention can adaptively change the replication topology using static and/or dynamic strategies. Embodiments of the invention can adapt the replication strategy using machine learning and artificial intelligence. Heavily loaded replica virtual machines can be scaled out to multiple replica virtual machines using, for example, cloning or virtual machine templates. Embodiments of the invention further allows multiple applications to fail over without the risk of running out of resources (e.g., by creating new replica virtual machines). IN addition, embodiments of the invention can also scale down the number of replica virtual machines based on the strategies discussed herein.
Assessing 1302 may also include one or more other elements in different combinations. For example, the assessment may be based on a replication strategy that uses a machine model. The machine model may use inputs that include characteristics of the virtual machines, applications, and/or the like. The inputs may be from the production virtual machines/applications and/or the replica virtual machines/applications. The machine model may output a value and when the value exceeds a threshold value, the topology of the replica virtual machines is changed.
The characteristics can be sampled over time to generate sample data. The sample data may be used to train the machine model. The machine model can be updated or retrained using more current inputs.
The output of the machine model may also reflect or predict a slowing or a failure of the replication at various locations including at the replica site. The machine model may also be able to identify potential solutions in addition to simply predicting a slow down or failure. The potential solutions may be reflected in the changed made to the topology of the replica virtual machines at 1306.
Assessing the applications may also include adapting to changes in the environment and adapt replication strategies that may be used in a static manner. For example, changes in the importance of an application, changes in failure domains, or changes in related applications are changes that may be used to change 1306 the topology of the replica virtual machines as previously described. These changes can be detected separately from the machine model or the machine model may be adapted to account for these changes and use additional inputs such as machine migration or other factors or characteristics used to identify related applications, quality of service, or failure domains as inputs. Assessing the applications may also include monitoring or receiving triggers related to incoming danger, such as natural disasters and the like that may impact the data protection system and the protected systems.
Changing 1306 the topology may include one or more of moving an application, whether using a tool or by moving a portable application. Changing 1306 the topology may also include scaling the replica virtual machines by cloning (e.g., cloning only the operating disk and then moving the applications) or using a virtual machine template and then moving the applications.
Embodiments of the invention, such as the examples disclosed herein, may be beneficial in a variety of respects. For example, and as will be apparent from the present disclosure, one or more embodiments of the invention may provide one or more advantageous and unexpected effects, in any combination, some examples of which are set forth below. It should be noted that such effects are neither intended, nor should be construed, to limit the scope of the claimed invention in any way. It should further be noted that nothing herein should be construed as constituting an essential or indispensable element of any invention or embodiment. Rather, various aspects of the disclosed embodiments may be combined in a variety of ways so as to define yet further embodiments. Such further embodiments are considered as being within the scope of this disclosure. As well, none of the embodiments embraced within the scope of this disclosure should be construed as resolving, or being limited to the resolution of, any particular problem(s). Nor should any such embodiments be construed to implement, or be limited to implementation of, any particular technical effect(s) or solution(s). Finally, it is not required that any embodiment implement any of the advantageous and unexpected effects disclosed herein.
The following is a discussion of aspects of example operating environments for various embodiments of the invention. This discussion is not intended to limit the scope of the invention, or the applicability of the embodiments, in any way.
In general, embodiments of the invention may be implemented in connection with systems, software, and components, that individually and/or collectively implement, and/or cause the implementation of, data protection operations. The scope of the invention embraces any operating environment in which the disclosed concepts may be useful.
At least some embodiments of the invention provide for the implementation of the disclosed functionality in existing data protection platforms, examples of which include RecoverPoint for VMs. In general however, the scope of the invention is not limited to any particular data backup platform or data storage environment.
New and/or modified data collected and/or generated in connection with some embodiments, may be stored in a data protection environment that may take the form of a public or private cloud storage environment, an on-premises storage environment, and hybrid storage environments that include public and private elements. Any of these example storage environments, may be partly, or completely, virtualized. The storage environment may comprise, or consist of, a datacenter which is operable to service read, write, delete, backup, restore, and/or cloning, operations initiated by one or more clients or other elements of the operating environment. Where a backup comprises groups of data with different respective characteristics, that data may be allocated, and stored, to different respective targets in the storage environment, where the targets each correspond to a data group having one or more particular characteristics.
Example public cloud storage environments in connection with which embodiments of the invention may be employed include, but are not limited to, Microsoft Azure, Amazon AWS, and Google Cloud. More generally however, the scope of the invention is not limited to employment of any particular type or implementation of cloud storage.
In addition to the storage environment, the operating environment may also include one or more clients that are capable of collecting, modifying, and creating, data. As such, a particular client may employ, or otherwise be associated with, one or more instances of each of one or more applications that perform such operations with respect to data.
Devices in the operating environment may take the form of software, physical machines, or virtual machines (VM), or any combination of these, though no particular device implementation or configuration is required for any embodiment. Similarly, data protection system components such as databases, storage servers, storage volumes (LUNs), storage disks, replication services, backup servers, restore servers, backup clients, and restore clients, for example, may likewise take the form of software, physical machines or virtual machines (VM), though no particular component implementation is required for any embodiment. Where VMs are employed, a hypervisor or other virtual machine monitor (VMM) may be employed to create and control the VMs. The term VM embraces, but is not limited to, any virtualization, emulation, or other representation, of one or more computing system elements, such as computing system hardware. A VM may be based on one or more computer architectures, and provides the functionality of a physical computer. A VM implementation may comprise, or at least involve the use of, hardware and/or software. An image of a VM may take various forms, such as a .VMDK file for example.
As used herein, the term ‘data’ is intended to be broad in scope. Thus, that term embraces, by way of example and not limitation, data segments such as may be produced by data stream segmentation processes, data chunks, data blocks, atomic data, emails, objects of any type, files of any type including media files, word processing files, spreadsheet files, and database files, as well as contacts, directories, sub-directories, volumes, and any group of one or more of the foregoing.
Example embodiments of the invention are applicable to any system capable of storing and handling various types of objects, in analog, digital, or other form. Although terms such as document, file, segment, block, or object may be used by way of example, the principles of the disclosure are not limited to any particular form of representing and storing data or other information. Rather, such principles are equally applicable to any object capable of representing information.
As used herein, the term ‘backup’ is intended to be broad in scope. As such, example backups in connection with which embodiments of the invention may be employed include, but are not limited to, full backups, partial backups, clones, snapshots, and incremental or differential backups.
Following are some further example embodiments of the invention. These are presented only by way of example and are not intended to limit the scope of the invention in anyway.
A method for replicating data from a production site to a replica site, the method comprising repeatedly assessing, by a data protection system, applications protected by the data protection system based on a replication strategy using a machine model configured to generate a prediction regarding failure of the applications, wherein the replication strategy is configured to predict the failure based on inputs that include characteristics of the virtual machines, replicating the applications from production virtual machines at the production site to the replica virtual machines according to the replication strategy, and changing a topology of the replica virtual machines based on the prediction of the machine model.
A method according to embodiment 1, further comprising performing sampling to collect sample data corresponding to the characteristics of the virtual machines.
A method according to embodiment 1 and/or 2, wherein the characteristics of the virtual machines include the one or more of disk IO latency, disk change rate, network IO latency, network change rate, processor utilization, memory utilization, IO error number, and/or IO error rate, wherein the machine model generates a value, wherein the topology is changed when the value exceeds a threshold value.
A method according to embodiment 1, 2 and/or 3, wherein the sample data includes data points and time series data.
A method according to embodiment 1, 2, 3, and/or 4, further comprising training the machine model with the sample data.
A method according to embodiment 1, 2, 3, 4, and/or 5, wherein the inputs further includes triggers regarding an incoming danger.
A method according to embodiment 1, 2, 3, 4, 5, and/or 6, further comprising monitoring for changes in the applications at the production site and adapting to the changes in the production site, the changes including quality of service changes, failure domain changes, and/or related application changes.
A method according to embodiment 1, 2, 3, 4, 5, 6, and/or 7, further comprising changing the topology by moving applications, moving portable applications, and/or scaling the replica virtual machines.
A method according to embodiment 1, 2, 3, 4, 5, 6, 7, and/or 8, further comprising scaling the replica virtual machines by cloning a replica virtual machine, wherein only an operating disk is cloned to generate a new replica virtual machine and wherein applications are then moved to the new replica virtual machine.
A method according to embodiment 1, 2, 3, 4, 5, 6, 7, 8, and/or 9, further comprising creating a replica virtual machine template, wherein scaling the replica virtual machines includes instantiating a new replica virtual machine based on the replica virtual machine template.
A method for performing any of the operations, methods, or processes, or any portion of any of these, disclosed herein in any of the embodiments disclosed herein in the specification and/or in the Figures.
A non-transitory storage medium having stored therein instructions that are executable by one or more hardware processors to perform the operations of any one or more of or portions of embodiments 1-11.
The embodiments disclosed herein may include the use of a special purpose or general-purpose computer including various computer hardware or software modules, as discussed in greater detail below. A computer may include a processor and computer storage media carrying instructions that, when executed by the processor and/or caused to be executed by the processor, perform any one or more of the methods disclosed herein, or any part(s) of any method disclosed.
As indicated above, embodiments within the scope of the present invention also include computer storage media, which are physical media for carrying or having computer-executable instructions or data structures stored thereon. Such computer storage media may be any available physical media that may be accessed by a general purpose or special purpose computer.
By way of example, and not limitation, such computer storage media may comprise hardware storage such as solid state disk/device (SSD), RAM, ROM, EEPROM, CD-ROM, flash memory, phase-change memory (“PCM”), or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other hardware storage devices which may be used to store program code in the form of computer-executable instructions or data structures, which may be accessed and executed by a general-purpose or special-purpose computer system to implement the disclosed functionality of the invention. Combinations of the above should also be included within the scope of computer storage media. Such media are also examples of non-transitory storage media, and non-transitory storage media also embraces cloud-based storage systems and structures, although the scope of the invention is not limited to these examples of non-transitory storage media.
Computer-executable instructions comprise, for example, instructions and data which, when executed, cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. As such, some embodiments of the invention may be downloadable to one or more systems or devices, for example, from a website, mesh topology, or other source. As well, the scope of the invention embraces any hardware system or device that comprises an instance of an application that comprises the disclosed executable instructions.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts disclosed herein are disclosed as example forms of implementing the claims.
As used herein, the term ‘module’ or ‘component’ may refer to software objects or routines that execute on the computing system. The different components, modules, engines, and services described herein may be implemented as objects or processes that execute on the computing system, for example, as separate threads. While the system and methods described herein may be implemented in software, implementations in hardware or a combination of software and hardware are also possible and contemplated. In the present disclosure, a ‘computing entity’ may be any computing system as previously defined herein, or any module or combination of modules running on a computing system.
In at least some instances, a hardware processor is provided that is operable to carry out executable instructions for performing a method or process, such as the methods and processes disclosed herein. The hardware processor may or may not comprise an element of other hardware, such as the computing devices and systems disclosed herein.
In terms of computing environments, embodiments of the invention may be performed in client-server environments, whether network or local environments, or in any other suitable environment. Suitable operating environments for at least some embodiments of the invention include cloud computing environments where one or more of a client, server, or other machine may reside and operate in a cloud environment.
Any one or more of the entities disclosed, or implied, by the Figures and/or elsewhere herein, may take the form of, or include, or be implemented on, or hosted by, a physical computing device. As well, where any of the aforementioned elements comprise or consist of a virtual machine (VM), that VM may constitute a virtualization of any combination of the physical components disclosed in herein.
An example of physical computing device may include a memory which may include one, some, or all, of random access memory (RAM), non-volatile random access memory (NVRAM), read-only memory (ROM), and persistent memory, one or more hardware processors, non-transitory storage media, UI device, and data storage. One or more of the memory components of the physical computing device may take the form of solid state device (SSD) storage. As well, one or more applications may be provided that comprise instructions executable by one or more hardware processors to perform any of the operations, or portions thereof, disclosed herein.
Such executable instructions may take various forms including, for example, instructions executable to perform any method or portion thereof disclosed herein, and/or executable by/at any of a storage site, whether on-premises at an enterprise, or a cloud storage site, client, datacenter, or backup server, to perform any of the functions disclosed herein. As well, such instructions may be executable to perform any of the other operations and methods, and any portions thereof, disclosed herein.
The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
Number | Name | Date | Kind |
---|---|---|---|
8230256 | Raut | Jul 2012 | B1 |
8402306 | Kruck et al. | Mar 2013 | B1 |
9201736 | Moore et al. | Dec 2015 | B1 |
9639592 | Natanzon et al. | May 2017 | B1 |
9727429 | Moore et al. | Aug 2017 | B1 |
9792131 | Uchronski | Oct 2017 | B1 |
9959061 | Natanzon et al. | May 2018 | B1 |
9977704 | Chopra et al. | May 2018 | B1 |
10067836 | Chopra et al. | Sep 2018 | B1 |
10120925 | Natanzon et al. | Nov 2018 | B1 |
10853111 | Gupta | Dec 2020 | B1 |
11036419 | Srikantan et al. | Jun 2021 | B1 |
11182188 | Weissman et al. | Nov 2021 | B2 |
11210150 | Setty | Dec 2021 | B1 |
11610121 | Teppoeva | Mar 2023 | B2 |
11663099 | Polimera et al. | May 2023 | B2 |
11669414 | Bhagi et al. | Jun 2023 | B2 |
20030051187 | Mashayekhi et al. | Mar 2003 | A1 |
20120084414 | Brock et al. | Apr 2012 | A1 |
20120151474 | Biran | Jun 2012 | A1 |
20120204061 | Agesen et al. | Aug 2012 | A1 |
20130151975 | Shadi | Jun 2013 | A1 |
20130185716 | Yin et al. | Jul 2013 | A1 |
20140040206 | Ramakrishnan et al. | Feb 2014 | A1 |
20140344805 | Shu et al. | Nov 2014 | A1 |
20150033133 | Thakur et al. | Jan 2015 | A1 |
20150341377 | Kasturi et al. | Nov 2015 | A1 |
20160048438 | Martos et al. | Feb 2016 | A1 |
20160314057 | De et al. | Oct 2016 | A1 |
20160371020 | Sarkar et al. | Dec 2016 | A1 |
20170091221 | Yin et al. | Mar 2017 | A1 |
20170185488 | Kumarasamy et al. | Jun 2017 | A1 |
20170242599 | Patnaik et al. | Aug 2017 | A1 |
20170300347 | Tian et al. | Oct 2017 | A1 |
20170371567 | Piduri | Dec 2017 | A1 |
20180129539 | Sadat | May 2018 | A1 |
20180285353 | Ramohalli et al. | Oct 2018 | A1 |
20180332073 | Ahmed | Nov 2018 | A1 |
20190163372 | Sridharan | May 2019 | A1 |
20190235904 | Epping et al. | Aug 2019 | A1 |
20190324785 | Weissman et al. | Oct 2019 | A1 |
20200004648 | Xu | Jan 2020 | A1 |
20200042632 | Natanzon et al. | Feb 2020 | A1 |
20200110655 | Harwood | Apr 2020 | A1 |
20200110675 | Wang et al. | Apr 2020 | A1 |
20200218711 | Natanzon | Jul 2020 | A1 |
20210200616 | Xu | Jul 2021 | A1 |
20210208981 | Karasev et al. | Jul 2021 | A1 |
20210208983 | Lin | Jul 2021 | A1 |
20210248047 | Jayaram et al. | Aug 2021 | A1 |
20210258219 | Kumarasamy et al. | Aug 2021 | A1 |
20210383206 | Teppoeva | Dec 2021 | A1 |
20220091915 | Perneti | Mar 2022 | A1 |
20230107511 | Mitkar et al. | Apr 2023 | A1 |
20230109510 | Polimera et al. | Apr 2023 | A1 |
Number | Date | Country |
---|---|---|
2937442 | Apr 2010 | FR |
2017106997 | Jun 2017 | WO |
Entry |
---|
International Search Report and Written Opinion received for PCT Patent Application No. PCT/US21/029079, dated Aug. 12, 2021 18 pages. |
Zhang Fei et al: “A Survey on Virtual Machine Migration: Challenges, Techniquesl, and Open Issues”, IEEE Communications Surveys & Tutorials, vol. 20, No. 2, May 22, 2018 pp. 1206-1243. |
U.S. patent Application filed on May 19, 2020, by Bing Liu, Entitled “Cost-Optimized True Zero Recovery Time Objective for Multiple Applications Based on Interdependent Applications”, U.S. Appl. No. 16/878,297. |
U.S. patent Application filed on May 19, 2020, by Bing Liu, Entitled “Cost-Optimized True Zero Recovery Time Objective for Multiple Applications Using Failure Domains”, U.S. Appl. No. 16/878,231. |
U.S. patent Application filed on May 19, 2020, by Bing Liu, Entitled “Cost-Optimized True Zero Recovery Time Objective for Multiple Applications”, U.S. Appl. No. 16/878,184. |
U.S. patent Application filed on May 19, 2020, by Bing Liu, Entitled “Dynamic Cost-Optimized True Zero Recovery Time Objective for Multiple Applications”, U.S. Appl. No. 16/878,206. |
International Preliminary Reporton Patentability received for PCT Patent Application No. PCT/US2021/028960, dated Dec. 1, 2022, 8 pages. |
Number | Date | Country | |
---|---|---|---|
20210365339 A1 | Nov 2021 | US |