The present invention relates to acceleration and virtualization services of virtual machines, such as replication and snapshots.
Data center virtualization technologies are now well adopted into information technology infrastructures. As more and more applications are deployed in a virtualized infrastructure, there is a growing need for performance acceleration, virtualization services, and business continuity at various levels.
Virtual servers are logical entities that run as software in a server virtualization infrastructure, also referred to as a “hypervisor”. A hypervisor provides storage device emulation, also referred to as “virtual disks”, to virtual servers. A hypervisor implements virtual disks using back-end technologies, such as files on a dedicated file system, or maps raw data to physical devices.
As distinct from physical servers that run on hardware, virtual servers execute their operating systems within an emulation layer that is provided by a hypervisor. Virtual servers may be implemented in software to perform the same tasks as physical servers. Such tasks include, for example, execution of server applications, such as database applications, customer relation management (CRM) applications, email servers, and the like. Generally, most applications that are executed on physical servers can be programmed to run on virtual servers. Virtual servers typically run applications that service a large number of clients. As such, virtual servers should provide high performance, high availability, data integrity and data continuity. Virtual servers are dynamic in the sense that they are easily moved from one physical system to another. On a single physical server the number of virtual servers may vary over time, with virtual machines added and removed from the physical server.
Conventional acceleration and virtualization systems are not designed to handle the demands created by the virtualization paradigm. Most conventional systems are not implemented at the hypervisor level to use virtual servers and virtual disks, but instead are implemented at the physical disk level. As such, these conventional systems are not fully virtualization-aware.
Because computing resources, such as CPU and memory, are provided to the virtual server by the hypervisor, the main bottleneck for the virtual server's operation resides in the storage path, and in particular the actual storage media, e.g., the magnetic hard disk drives (HDDs). An HDD is an electromechanical device and as such, performance, especially random access performance, is extremely limited due to rotational and seek latencies. Specifically, any random access READ command requires an actuator movement to position the head over the correct track as part of a seek command, which then incurs additional rotational latencies until the correct sector has moved under the head.
Another type of media storage is a solid state disk or device (SSD), which is a device that uses solid state technology to store its information, and provides access to the stored information via a storage interface. An SSD may use NAND flash memory to store the data, and a controller that provides regular storage connectivity (electrically and logically) to flash memory commands (program and erase). Such a controller can use embedded SRAM, additional DRAM memory, battery backup and other elements.
Flash based storage devices (or raw flash) are purely electronic devices, and as such do not contain any moving parts. Compared to HDDs, a READ command from flash device is serviced in an immediate operation, yielding much higher performance especially in the case of small random access read commands. In addition, the multi-channel architecture of modern NAND flash-based SSDs results in sequential data transfers saturating most host interfaces.
Because of the higher cost per bit, deployment of solid state drives faces some limitations in general. In the case of NAND flash memory technology, another issue that comes into play is limited data retention. It is not surprising, therefore, that cost and data retention issues along with the limited erase count of flash memory technology are prohibitive for acceptance of flash memory in back-end storage devices. Accordingly, magnetic hard disks still remain the preferred media for the primary storage tier. A commonly used solution, therefore, is to use fast SSDs as cache for inexpensive HDDs.
Because the space in the cache is limited, efficient caching algorithms must make complex decisions on what part of the data to cache and what not to cache. Advanced algorithms for caching also require the collection of storage usage statistics over time for making an informed decision on what to cache and when to cache it.
Virtualization services, such as snapshots and remote replication are available on the storage level or at the application level. For example, the storage can replicate its volumes to storage at a remote site. An application in a virtual server can replicate its necessary data to an application at a remote site. Backup utilities can replicate files from the virtual servers to a remote site. However, acceleration and virtualization services outside the hypervisor environment suffer from inefficiency, lack of coordination between the services, multiple services to manage and recover, and lack of synergy.
An attempt to resolve this inefficiency leads to a unified environment of acceleration and virtualization in the hypervisor. This provides an efficient, simple to manage storage solution, dynamically adaptive to the changing virtual machine storage needs and synergy. Accordingly, the hypervisor is the preferred environment to place the cache, in this case an SSD.
To help with efficient routing of data through hypervisors, the hypervisor manufacturers allow for hooks in the hypervisor that enable inserting filtering code. However, there are strong limitations on the memory and coding of the inserted filter code. This would limit today's caching solutions from inserting large amounts of logic into the hypervisor code.
Certain embodiments disclosed herein include a cross-host multi-hypervisor system which includes a plurality of accelerated and optional non-accelerated hosts which are connected through a communications network configured to synchronize migration of virtual servers and virtual disks from one accelerated host to another while maintaining coherency of services such as cache, replication and snapshots. In one embodiment, each host contains at least one virtual server in communication with a virtual disk, wherein the virtual server can read from and write to the virtual disk. In addition, each host site has an adaptation layer with an integrated cache layer, which is in communication with the virtual server and intercepts cache commands by the virtual server to the virtual disk, the cache commands include, for example, read and write commands.
Each accelerated host further contains a local cache memory, preferably in the form of a flash-based solid state drive. In addition, to the non-volatile flash memory tier, a DRAM-based tier may yield even higher performance. The local cache memory is controlled by the cache layer which governs the transfer of contents such as data and metadata from the virtual disks to the local cache memory.
The adaptation layer is further in communication with a Virtualization and Acceleration Server (VXS), which receives the intercepted commands from the adaptation layer for managing volume replication, volume snapshots and cache management. The cache layer, which is integrated in the adaptation layer, accelerates the operation of the virtual servers by managing the caching of the virtual disks. In one embodiment, the caching includes transferring data and metadata into the cache tier(s), including replication and snapshot functionality provided by the VXS to the virtual servers.
In one embodiment, the contents of any one cache, comprising data and metadata, from any virtual disk in any host site in the network can be replicated in the cache of any other host in the network. This allows seamless migration of a virtual disk from any between host without incurring a performance hit since the data are already present in the cache of the second host.
The VXS further provides cache management and policy enforcement via workload information. The virtualization and acceleration servers in different hosts are configured to synchronize with each other to enable migration of virtual servers and virtual disks across hosts.
Certain embodiments of the invention further include a hypervisor for accelerating cache operations. The hypervisor comprises at least one virtual server; at least one virtual disk that is read from and written to by the at least one virtual server; an adaptation layer having therein a cache layer and being in communication with at least one virtual server, wherein the adaptation layer is configured to intercept and cache storage commands issued by the at least one virtual server to the at least one virtual disk; and at least one virtualization and acceleration server (VXS) in communication with the adaptation layer, wherein the VXS is configured to receive the intercepted cache commands from the adaptation layer and manage perform at least a volume replication service, a volume snapshot service, and a cache volume service, and cache synchronization between a plurality of host sites.
Certain embodiments of the invention further include a method for synchronizing migration of virtual servers across a plurality of host computers communicatively connected through a network, wherein each host computer has at least one virtual server connected to at least one virtual disk, an adaptation layer in communication with the at least one virtual server and with a virtualization and acceleration server (VXS). The method comprises intercepting cache commands from the at least one virtual server to the virtual disk by the adaptation layer; communicating the intercepted cache commands from the adaptation layer to the virtualization and acceleration server; and performing, based on the intercepted cache commands, at least a volume replication service, a volume snapshot service, a cache volume service and synchronizing cache between the plurality of host computers.
The subject matter that is regarded as the invention is particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other objects, features and advantages of the invention will be apparent from the following detailed description taken in conjunction with the accompanying drawings.
The embodiments disclosed herein are only examples of the many possible advantageous uses and implementations of the innovative teachings presented herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claimed inventions. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in plural and vice versa with no loss of generality in the drawings; like numerals refer to like parts through several views.
In the hypervisor 100, the data path establishes a direct connection between a virtual server (e.g., server 110-1) and its respective virtual disk (e.g., 140-1). According to one embodiment, the adaptation layer 130 is located in the data path between the virtual servers 110 and the virtual disks 140-1, 140-n, where every command from a virtual server 120 to any virtual disk passes through the adaptation layer 130.
The VXS 120 is executed as a virtual server and receives data from the adaptation layer 130. The VXS 120 uses its own dedicated virtual disk 143 to store relevant data and metadata (e.g., tables, logs).
The cache memory 150 is connected to the adaptation layer 130 and utilized for acceleration of I/O operations performed by the virtual servers 110 and the VXS 120. The adaptation layer 130 utilizes the higher performance of the cache memory 150 to store frequently used data and fetch it upon request (i.e., cache).
An exemplary and non-limiting block diagram of the adaptation layer 130 and VXS 120 and their connectivity is illustrated in
In one embodiment, the cache layer 220 can assign a RAM media as a faster tier (to the flash media 150) to provide a higher level of caching. The cache layer 220 manages data caching operation all data in the data path, including data from the virtual servers 110 to the virtual disks 140-1, 140-n and also from the VXS 120 to its virtual disk 143. Hence, acceleration is achieved to the data path flowing between virtual disks and virtual servers and also to the virtualization functionality provided by the VXS 120. In another embodiment, the cache layer 220 governs caching of specific virtual disks requiring acceleration as configured by the user (e.g., a system administrator). In yet another embodiment, the cache layer 220 can differentiate between the caching levels via assignment of resources, thus providing Quality of Service (QoS) for the acceleration.
The VXS 120 includes a volume manger 230, a cache manager 240, a replication manager 250, and a snapshot manager 260. The VXS 120 receives data cache commands from the adaptation layer 130. The data cache commands are first processed by the volume manager 230 that dispatches the commands to their appropriate manager according to a-priori user configuration settings saved in the configuration module 270. For better flexibility and adaptation to any workload or environment, the user can assign the required functionality per each virtual disk 140-1, 140-n. As noted above, a virtual disk can be referred to as a volume.
The VXS 120 can handle different functionalities which include, but are not limited to, volume replication, volume snapshot and volume acceleration. Depending on the required functionality to a virtual disk 140-1, 140-n, as defined by the configuration in the module 270, the received data commands are dispatched to the appropriate modules of the VXS 120. These modules include the replication manager 250 for replicating a virtual disk (volume), a snapshot manager 260 for taking and maintaining a snapshot of a virtual disk (volume), and a cache manager 240 to manage cache information (statistics gathering, policy enforcement, etc,) to assist the cache layer 220.
The cache manager 240 is also responsible for policy enforcement of the cache layer 220. In one embodiment, the cache manager 240 decides what data to insert the cache and/or to remove from the cache according to an a-priori policy that can be set by a user (e.g., an administrator) based on known, for example and without limitation, user activity or records of access patterns. In addition, the cache manager 240 is responsible for gathering statistics and performing a histogram on the data workload in order to profile the workload pattern and detect hot zones therein.
The replication manager 250 replicates a virtual disk (140-1, 140-n) to a remote site over a network, e.g., over a WAN. The replication manager 250 is responsible for recording changes to the virtual disk, storing the changes in a change repository (i.e., a journal) and transmitting the changes to a remote site upon a scheduled policy. The replication manager 250 may further control replication of the cached data and the cache mapping to one or more additional VXL modules on one or more additional physical servers located at a remote site. Thus, the mapping may co-exist on a collection of servers allowing transfer or migration of the virtual servers between physical systems while maintaining acceleration of the virtual servers. The snapshot manager 260 takes and maintains snapshots of virtual disks 140-1, 140-n which are restore points to allow for restoring of virtual disks to each snapshot.
An exemplary and non-limiting flowchart 300 describing the handling of a read command issued by a virtual server to a virtual disk is shown in
If S310 returns a No answer, i.e., the data requested in the command do not reside in the cache, the received read command is passed, at S330, to the virtual disk via the IO layer and in parallel, at S350, to the VXS 120 for statistical analysis.
An exemplary and non-limiting flowchart 400 for handing of a read callback when data to a read command are returned from the virtual disk to the virtual server is shown in
At S510, it is checked if the data to be written as designated in the write command reside in the cache memory 150. If so, at S520 the respective cached data are invalidated. After the invalidation, or if it was not required, the write command is sent, at S530, through the IO layer 180 to the physical disk 160 and at 3540 to the VXS 120 for processing and update of the virtual disks 140. A write command is processed in the VXS 120 according to the configuration saved in the configuration module 270. As noted above, such processing may include, but are not limited to, data replication, snapshot, and caching of the data.
An exemplary and non-limiting flowchart 600 illustrating the operation of the replication manager 250 is shown in
The execution reaches S620 where a virtual volume is replicated by the replication manager 250. The virtual volume is in one of the virtual disks 120 assigned to the virtual server from which the command is received. At S630, the replication manager 250 saves changes made to the virtual volume in a change repository (not shown) that resides in the virtual disk 143 of the VXS 120. In addition, the replication manager 250 updates the mapping tables and the metadata in the change repository. In one embodiment, at S640, at a pre-configured schedule, e.g., every day at 12:00 PM, a scheduled replication is performed to send the data changes aggregated in the change repository to a remote site, over the network, e.g., a WAN.
An exemplary and non-limiting flowchart 700 illustrating the operation of the snapshot manager 260 is shown in
At S720, the command reaches the volume manager 260 when the volume, i.e., one of the virtual disks, is a snapshot volume. At S730, the snapshot manager 260 saves changes to the volume and updates the mapping tables (if necessary) in the snapshot repository in the virtual disk 143 of the VXS 120.
An exemplary and non-limiting flowchart 800 illustrating the operation the cache manager 240 is shown in
At S820, the received command reaches the cache manager 240. At S830, the cache manager 260 updates its internal cache statistics, for example, cache hit, cache miss, histogram, and so on. At S840, the cache manager 240 calculates and updates its hot zone mapping every time period (e.g., every minute). More specifically, every predefined time period or interval in which the data are not accessed, their temperature decreases, and, on any new access, the temperature increases again. The different data temperatures can be mapped as zones, for example from 1 to 10 but any other granularity is possible. Then, at S850 the cache manager 240 updates its application specific policies. For example, in an Office environment, a list of frequently requested documents can be maintained and converted into a caching policy for the specific application, which is updated every time a document is accessed.
According to one embodiment, in a plurality of accelerated hosts, VXS units in each accelerated host communicate with each other to achieve synchronization of configurations and to enable migration of virtual servers and virtual disks from one host to another. The host may an accelerated host or a non-accelerated host. That is, the synchronization of configurations may be performed from an accelerated host to a non-accelerated host, or vice versa. As noted above, each accelerated host also includes a local cache memory, preferably in the form of a flash-based solid state drive. In addition to the non-volatile flash memory tier, a DRAM-based tier may yield even higher performance. The local cache memory is controlled by the cache layer which governs the transfer of contents such as data and metadata from the virtual disks to the local cache memory.
According to another embodiment, the hosts 100-A and 100-B can also share the same virtual disk, thus achieving data synchronization via the hypervisor cluster mechanism.
The foregoing detailed description has set forth a few of the many forms that the invention can take. It is intended that the foregoing detailed description be understood as an illustration of selected forms that the invention can take and not as a limitation as to the definition of the invention.
Most preferably, the embodiments described herein can be implemented as any combination of hardware, firmware, and software. Moreover, the software is preferably implemented as an application program tangibly embodied on a program storage unit or computer readable medium. The application program may be uploaded to, and executed by, a machine comprising any suitable architecture. Preferably, the machine is implemented on a computer platform having hardware such as one or more central processing units (“CPUs”), a memory, and input/output interfaces. The computer platform may also include an operating system and microinstruction code. The various processes and functions described herein may be either part of the microinstruction code or part of the application program, or any combination thereof, which may be executed by a CPU, whether or not such computer or processor is explicitly shown. In addition, various other peripheral units may be connected to the computer platform such as an additional data storage unit and a printing unit. Furthermore, a non-transitory computer readable medium is any computer readable medium except for a transitory propagating signal.
This application claims the benefit of U.S. provisional application No. 61/555,145 filed Nov. 3, 2011, the contents of which are herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
61555145 | Nov 2011 | US |