DATA PATH STRATEGIES FOR REPLICA VOLUMES

Information

  • Patent Application
  • 20240419603
  • Publication Number
    20240419603
  • Date Filed
    June 13, 2023
    a year ago
  • Date Published
    December 19, 2024
    a month ago
Abstract
Techniques providing data path strategies for improving storage performance at DR sites. The techniques include receiving, in an asynchronous replication process, a large replication data transfer including data changes of a production volume since the last synchronization to a replica volume, partitioning the replication data into multiple small write requests, tagging each small write request as a write request to the replica volume, and performing early evicting, from cache memory, all cache pages used to cache host data specified in the small write requests; deep compression of contiguous host data specified in the small write requests; stream separation on the small write requests, each small write request being tagged as corresponding to a specific production site; and/or flushing host data having the same retention period to a specific region of physical storage space for the replica volume, each small write request being tagged with hint information indicating the retention period.
Description
BACKGROUND

Storage systems include processing circuitries and storage arrays containing storage devices such as solid-state drives (SSDs), flash drives, and/or hard disk drives (HDDs). The processing circuitries perform input/output (IO) operations in response to storage IO requests (e.g., read requests, write requests) issued over a network by host computers coupled to the storage systems. The IO operations (e.g., read operations, write operations) cause host data including data blocks, data pages, data files, or other data elements specified in the storage IO requests to be read from or written to volumes (VOLs), logical units (LUs), filesystems, or other storage objects or resources stored on the storage devices. To provide backup or remote storage of host data, the storage systems perform asynchronous replication to replicate or write the host data on production volumes stored at a production site to replica volumes stored at a disaster recovery (DR) site.


SUMMARY

A storage system can perform asynchronous replication of host data on production volumes based on a recovery point objective (RPO), which can represent a maximum amount of data that a user of the storage system would be willing to lose in the event of a failure or disaster at a production site where the production volumes are stored. The RPO can determine a minimum frequency of asynchronous replication, which can be represented by a specified RPO interval such as 5 minutes, 15 minutes, or any other suitable interval. During periods between successive synchronizations, the storage system may make changes to the host data on the production volumes. The storage system can replicate or write some or all changes made to the host data since the most recent synchronization to replica volumes stored at a DR site, in accordance with the specified RPO interval, thereby generating, for each RPO interval, consistent snapshots of the replica volumes. The host data can be read from the replica volumes when a disaster recovery test or an actual disaster recovery is performed at the DR site, or when a failover is performed due to a failure or planned maintenance at the production site. However, host data tends typically to be mostly written to such replica volumes, rather than read from the replica volumes. If any of the snapshots of the replica volumes are not used within a certain period, then they can be unmapped and deleted from storage at the DR site.


Techniques are disclosed herein that provide data path strategies for improving the performance and/or efficiency of storage systems deployed at DR sites. The disclosed techniques can be practiced in a storage environment that includes a source storage system (or “source node”) deployed at a production site and a destination storage system (or “destination node”) deployed at a DR site, which can be a mixed-use DR site configured to store production volumes as well as replica volumes. In the disclosed techniques, the source node can perform asynchronous replication of host data on a production volume stored at the production site, replicating or writing some or all changes made to the host data at specific offsets of the production volume since the most recent synchronization to a replica volume stored at the DR site, in accordance with a specified RPO interval. The disclosed techniques can include reading the data changes at the specific offsets of the production volume, accumulating the data changes for a “large” (e.g., a 512 kilobyte (Kb)) replication data transfer from the source node at the production site to the destination node at the DR site. In the disclosed techniques, the large replication data transfer can include multiple “small” write requests of various sizes (e.g., 4 Kb, 16 Kb, 64 Kb), some of which can be logically contiguous based on offset. The disclosed techniques can include, upon receipt of the large replication data transfer at the destination node, partitioning it into a plurality of small write requests, keeping any logically contiguous data together as part of the same write request; tagging each small write request as a write request to the replica volume; and, in response to each tagged small write request, performing a write operation to write a data change at a specific offset to the replica volume.


In certain asynchronous replication scenarios, the disclosed techniques can leverage host data being mostly written to, rather than read from, replica volumes stored at a mixed-use DR site. In one scenario, each write operation performed in response to a tagged small write request can write a data change at a specific offset of a replica volume to a cache page in cache memory of a destination node. The disclosed techniques can include flushing data written to the cache page from the cache memory to the replica volume, early evicting the cache page from the cache memory, and either returning the cache page to a free page list of the cache memory or placing the cache page at the head of a least recently used (LRU) list of the cache memory. In this way, IO operations directed to production volumes stored at the mixed-use DR site can be benefited by freeing-up such cache pages sooner.


In another scenario, a large replication data transfer to a replica volume can specify a large range or amount (e.g., up to 512 Kb or more) of contiguous host data. The disclosed techniques can include performing deep compression on the contiguous host data before flushing the host data to the replica volume. Because host data is mostly written to, rather than read from, replica volumes stored at the DR site, any read penalty resulting from performing such deep compression on contiguous host data is likely to be low.


In still another scenario, in addition to being tagged as a write request to a replica volume, each small write request to the replica volume can also be tagged to identify or indicate a production site where a corresponding source node is deployed. In this scenario, the production site may be one of several production sites, in which each production site mandates a different class of service (CoS) for providing backup or remote storage of host data on its production volumes. The disclosed techniques can include, upon receipt of a large replication data transfer at a destination node, partitioning it into a plurality of small write requests, tagging each small write request as a write request to a replica volume, and further tagging the small write request as corresponding to a specific production site. The disclosed techniques can include performing stream separation on a plurality of such multi-tagged small write requests based at least on specific production sites identified or indicated in the respective write requests, and, for each resulting stream of small write request transactions, performing write operations to write data changes at specific offsets to the replica volume for subsequent storage to a storage tier that conforms to the CoS mandated by the specific production site.


In yet another scenario, each large transfer of replication data can be tagged with hint information pertaining to a retention period for host data to be written to a replica volume at the DR site. The disclosed techniques can include reading changes made to host data at specific offsets of a production volume since the most recent synchronization to the replica volume, accumulating the data changes in a large replication data transfer to the replica volume, tagging the large transfer of replication data with hint information pertaining to a retention period for the accumulated data changes, and sending the tagged large transfer of replication data from a source node to a destination node. The disclosed techniques can include, upon receipt of the tagged large transfer of replication data at the destination node, partitioning it into a plurality of small write requests, tagging each small write request as a write request to the replica volume, and further tagging the small write request with the hint information pertaining to the retention period of the host data. The disclosed techniques can include flushing host data having the same retention period to a specific region of physical storage space (e.g., a physical large block (PLB)) for the replica volume. Because host data having the same retention period can be flushed to the same PLB for the replica volume, subsequent deletion of the host data from the PLB at the expiration of the retention period can be more efficiently performed.


By receiving, in an asynchronous replication process, a large transfer of replication data including accumulated changes made to data of a production volume since the most recent synchronization to a replica volume, partitioning the large transfer of replication data into a plurality of small write requests, tagging each small write request as a write request to the replica volume, and, in response to servicing the plurality of small write requests, performing one or more of (i) early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, (ii) deep compression of contiguous host data specified in the plurality of small write requests, (iii) stream separation on the plurality of small write requests, each small write request being further tagged as corresponding to a specific production site, and (iv) flushing host data having the same retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with hint information pertaining to the retention period, the performance and/or efficiency of storage systems deployed at DR sites can be improved.


In certain embodiments, a method of improving performance and/or efficiency of storage systems deployed at disaster recovery (DR) sites includes receiving, at a destination node of a DR site from a source node at a production site, a large transfer of replication data including accumulated changes made to data of a production volume since a most recent synchronization to a replica volume in an asynchronous replication process; partitioning the large transfer of replication data into a plurality of small write requests; tagging each small write request as a write request to the replica volume; and, in response to servicing the plurality of small write requests, performing one or more of (i) early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, (ii) deep compression of contiguous host data specified in the plurality of small write requests, (iii) stream separation on the plurality of small write requests, each small write request being further tagged as corresponding to a specific production site, and (iv) flushing host data having the same retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with hint information pertaining to the retention period are performed.


In certain arrangements, the method includes writing the changes made to data of the replica volume to a cache page in the cache memory.


In certain arrangements, the changes made to the data written to the cache page are marked as being targeted to the replica volume. The method includes flushing the marked changes from the cache memory to the replica volume, and evicting the cache page from the cache memory in response to flushing the marked changes from the cache memory to the replica volume.


In certain arrangements, the method includes returning the cache page to a free page list of the cache memory.


In certain arrangements, the method includes placing the cache page at a head of a least recently used (LRU) list of the cache memory.


In certain arrangements, the method includes performing inline deep compression of the contiguous host data at the DR site.


In certain arrangements, the method includes writing a compressed version of the host data to a cache page in the cache memory.


In certain arrangements, the method includes flushing the compressed version of the host data from the cache memory to the replica volume.


In certain arrangements, the method includes tagging each small write request with a production site identifier (ID) identifying the production site where the source node is deployed.


In certain arrangements, the method includes tagging each small write request with the production site ID.


In certain arrangements, the method includes performing inline stream separation on the plurality of small write requests based at least on the production site ID.


In certain arrangements, the production site mandates a specific class of service (CoS) for providing backup or remote storage of host data. The method includes, for each stream of small write request transactions resulting from performing the stream separation, performing write operations to write, to the replica volume, the changes made to the data of the replica volume for subsequent storage to a storage tier that conforms to the specific CoS mandated by the production site.


In certain arrangements, the large transfer of replication data is tagged with hint information pertaining to a retention period for the host data. The method includes tagging each small write request with the hint information pertaining to the retention period for the host data.


In certain arrangements, the method includes flushing the host data having the same retention period to a specific region of physical storage space for the replica volume.


In certain embodiments, a system for improving performance and/or efficiency of storage systems deployed at disaster recovery (DR) sites includes a memory and processing circuitry configured to execute program instructions out of the memory to receive, at a destination node of a DR site from a source node at a production site, a large transfer of replication data including accumulated changes made to data of a production volume since a most recent synchronization to a replica volume in an asynchronous replication process, partition the large transfer of replication data into a plurality of small write requests, tag each small write request as a write request to the replica volume, and, in response to servicing the plurality of small write requests, perform one or more of (i) early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, (ii) deep compression of contiguous host data specified in the plurality of small write requests, (iii) stream separation on the plurality of small write requests, each small write request being further tagged as corresponding to a specific production site, and (iv) flushing host data having the same retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with hint information pertaining to the retention period are performed.


In certain arrangements, the processing circuitry is configured to execute the program instructions out of the memory to write the changes made to data of the replica volume to a cache page in the cache memory, and perform one of (i) returning the cache page to a free page list of the cache memory, and (ii) placing the cache page at a head of a least recently used (LRU) list of the cache memory.


In certain arrangements, the processing circuitry is configured to execute the program instructions out of the memory to perform inline deep compression of the contiguous host data at the DR site.


In certain arrangements, the processing circuitry is configured to execute the program instructions out of the memory to tag each small write request with a production site identifier (ID) identifying the production site where the source node is deployed, and perform inline stream separation on the plurality of small write requests based at least on the production site ID.


In certain arrangements, the large transfer of replication data is tagged with hint information pertaining to a retention period for the host data, and the processing circuitry is configured to execute the program instructions out of the memory to tag each small write request with the hint information pertaining to the retention period for the host data, and flush the host data having the same retention period to a specific region of physical storage space for the replica volume.


In certain embodiments, a computer program product includes a set of non-transitory, computer-readable media having instructions that, when executed by processing circuitry, cause the processing circuitry to perform a method that includes receiving, at a destination node of a DR site from a source node at a production site, a large transfer of replication data including accumulated changes made to data of a production volume since a most recent synchronization to a replica volume in an asynchronous replication process, partitioning the large transfer of replication data into a plurality of small write requests, tagging each small write request as a write request to the replica volume, and, in response to servicing the plurality of small write requests, performing one or more of (i) early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, (ii) deep compression of contiguous host data specified in the plurality of small write requests, (iii) stream separation on the plurality of small write requests, each small write request being further tagged as corresponding to a specific production site, and (iv) flushing host data having the same retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with hint information pertaining to the retention period are performed.


Other features, functions, and aspects of the present disclosure will be evident from the Detailed Description that follows.





BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features, and advantages will be apparent from the following description of particular embodiments of the present disclosure, as illustrated in the accompanying drawings, in which like reference characters refer to the same parts throughout the different views.



FIG. 1 is a block diagram of an exemplary storage environment, in which techniques can be practiced for providing data path strategies that can improve the performance and/or efficiency of storage systems deployed at disaster recovery (DR) sites;



FIG. 2 is a block diagram of an exemplary storage node that can be included in a storage system within the storage environment of FIG. 1;



FIG. 3a is a block diagram of an exemplary hardware/software stack that can be included in a data path of the storage node of FIG. 2;



FIG. 3b is a block diagram of exemplary layered services that can be included in the hardware/software stack of FIG. 3a;



FIG. 4 is a block diagram of an exemplary small write request that can be generated by the storage node of FIG. 2; and



FIG. 5 is a flow diagram of an exemplary method of a data path strategy for improving the performance and/or efficiency of storage systems deployed at DR sites.





DETAILED DESCRIPTION

Techniques are disclosed herein that provide data path strategies for improving the performance and/or efficiency of storage systems deployed at disaster recovery (DR) sites. The disclosed techniques can include receiving, in an asynchronous replication process, a “large” transfer of replication data including accumulated changes made to data of a production volume since the most recent synchronization to a replica volume, partitioning the large transfer of replication data into a plurality of “small” write requests, and tagging each small write request as a write request to the replica volume. The disclosed techniques can further include, in response to servicing the plurality of small write requests, performing one or more of (i) early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, (ii) deep compression of contiguous host data specified in the plurality of small write requests, (iii) stream separation on the plurality of small write requests, each small write request being further tagged as corresponding to a specific production site, and (iv) flushing host data having the same retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with hint information pertaining to the retention period. In this way, the performance and/or efficiency of storage systems deployed at DR sites can be improved.



FIG. 1 depicts an illustrative embodiment of an exemplary storage environment 100, in which techniques can be practiced for providing data path strategies that can improve the performance and/or efficiency of storage systems deployed at DR sites. As shown in FIG. 1, the storage environment 100 can include a plurality of host computers 102, a source storage system (or “source node”) 110A, a destination storage system (or “destination node”) 110B, and a communications medium 103 that includes at least one network 106. For example, each of the plurality of host computers 102 may be configured as a web server computer, a file server computer, an email server computer, an enterprise server computer, and/or any other suitable client/server computer or computerized device. The plurality of host computers 102 can be configured to provide, over the network(s) 106, storage input/output (IO) requests (e.g., small computer system interface (SCSI) commands, network file system (NFS) commands) to the source node 110A and/or the destination node 110B. For example, each storage IO request (e.g., read request, write request) may direct the source node 110A and/or the destination node 110B to read or write data blocks, data pages, data files, and/or any other suitable data elements (also referred to herein as “host data”) to/from volumes (VOLs), virtual volumes (VVOLs) (e.g., VMware® VVOLs), logical units (LUs or LUNs), filesystems, directories, files, and/or any other suitable storage objects maintained in association with the source node 110A and/or the destination node 110B.


In one embodiment, the source node 110A can be deployed at a production site where at least one production volume 114A is stored, and the destination node 110B can be deployed at a disaster recovery (DR) site where at least one production volume 116, as well as at least one replica volume 114B, are stored. As such, the DR site is referred to herein as a mixed-use DR site, which can store both production volumes and replica volumes. In this embodiment, the replica volume 114B can be obtained at the DR site in an asynchronous replication process. Such an asynchronous replication process can include, in response to a write request issued to the source node 110A by one of the host computers 102, performing a write operation to write host data to the production volume 114A, acknowledging completion of the write operation to the host computer 102, and, having acknowledged the completion of the write operation, asynchronously performing a large transfer of replication data to the replica volume 114B.



FIG. 2 depicts an exemplary configuration of a storage node 110A/B. It is noted that each of the source node 110A and the destination node 110B (see FIG. 1) can be configured like the storage node 110A/B (see FIG. 2). As shown in FIG. 2, the storage node 110A/B can include a communications interface 202, processing circuitry 204, a memory 206, device interfaces 208, storage devices 210, and/or any other suitable storage node component(s). The communications interface 202 can include an InfiniBand interface, an Ethernet interface, an IEEE 802.11x (WiFi) interface, a Bluetooth interface, and/or any other suitable communications interface. The communications interface 202 can further include SCSI target adapters, network interface adapters, and/or any other suitable adapters for converting electronic, optical, and/or wireless signals received over the network 106 to a form suitable for use by the processing circuitry 204.


The memory 206 can include non-persistent memory (e.g., dynamic random-access memory (DRAM), static random-access memory (SRAM)) and/or persistent memory (e.g., flash memory, magnetic memory). The memory 206 can be configured to store a variety of software constructs including a mapper module (or “mapper”) 214, data path processing components 216 (see also FIG. 3a), and other specialized code and data 218 (e.g., deep compression code and data, stream separation code and data), each of which can be executed by the processing circuitry 204 as program instructions within an operating system 212 to carry out the techniques disclosed herein. For example, the operating system (OS) 212 may be implemented as a Linux OS, Unix OS, Windows OS, or any other suitable operating system. The mapper 214 can employ a tree structure for storing host data to VOLs, VVOLs, LUs, LUNs, filesystems, directories, and/or files to the storage devices 210 (e.g., SSDs, flash drives, HDDs). In one embodiment, the tree structure can be configured as a B-tree structure that includes multiple levels for accommodating root pages, top pages, middle (or “mid”) pages, leaf pages, virtual large blocks (VLBs), and physical large blocks (PLBs). The root pages can be configured to provide a logical address space with pointers to respective ones of the top pages, which can be configured with pointers to respective ones of the mid-pages. Further, the mid-pages can be configured with pointers to respective ones of the leaf pages, which can be configured with pointers to the VLBs. The VLBs can include reference counts, data compression maps, and/or accounting information for the PLBs, each of which can be configured to provide a two (2) megabyte (Mb) physical space for storing the host data.


The processing circuitry 204 can include one or more physical processors and/or engines configured to execute the software constructs (e.g., the OS 212, the mapper 214, the data path processing components 216, the specialized code and data 218) stored in the memory 206, as well as data movers, director boards, blades, IO modules, drive controllers, switches, and/or any other suitable computer hardware or combination thereof. For example, the processing circuitry 204 may execute the program instructions out of the memory 206, process storage IO requests (e.g., read requests, write requests) from the host computers 102, and store host data to the storage devices 210 within the storage environment 100, which can be a clustered RAID environment.


The device interfaces 208 can be configured to facilitate data transfers to/from the storage devices 210. The device interfaces 208 can include device interface modules such as disk adapters, disk controllers, or other backend components configured to interface with the physical storage devices 210 (e.g., SSDs, flash drives, HDDs). The device interfaces 208 can be configured to perform data operations using a RAM cache included in the memory 206 when communicating with the storage devices 210, which can be incorporated into a storage array.


In the context of the processing circuitry 204 being configured to execute the software constructs (e.g., the mapper 214, the data path processing components 216, the specialized code and data 218) as program instructions out of the memory 206, a computer program product can be configured to deliver all or a portion of the program instructions to the processing circuitry 204. Such a computer program product can include one or more non-transient computer-readable storage media, such as a magnetic disk, a magnetic tape, a compact disk (CD), a digital versatile disk (DVD), an optical disk, a flash drive, a solid state drive (SSD), a secure digital (SD) chip or device, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), and so on. The non-transient computer-readable storage media can be encoded with sets of program instructions for performing, when executed by the processing circuitry 204, the various techniques disclosed herein.


During operation, the disclosed techniques can provide data path strategies for improving the performance and/or efficiency of storage systems deployed at disaster recovery (DR) sites, such as the destination node 110B deployed at the mixed-use DR site. The disclosed techniques can include performing an asynchronous replication process to replicate host data on the production volume 114A stored at the production site, writing some or all changes made to the host data at specific offsets of the production volume 114A since the most recent synchronization to the replica volume 114B stored at the DR site, in accordance with a specified recovery point objective (RPO) interval. The disclosed techniques can include, at the source node 110A, reading the data changes at the specific offsets of the production volume 114A, accumulating the data changes for a large (e.g., a 512 kilobyte (Kb)) replication data transfer from the source node 110A at the production site over the communication path 112 to the destination node 110B at the DR site. For example, the large replication data transfer may specify a 16 Kb chunk at a first offset, a 4 Kb chunk at a second offset, a 64 Kb chunk at a third offset, and so on. Further, some of the data chunks specified in the large replication data transfer may be logically contiguous based on offset. The disclosed techniques can include, upon receipt of the large replication data transfer at the destination node 110B, partitioning it into a plurality of small (e.g., 4 Kb, 16 Kb, 64 Kb, and so on) write requests, keeping any logically contiguous data together as part of the same write request; tagging each small write request as a write request to the replica volume 114B; and, in response to each tagged small write request, performing a write operation to write a data change at a specific offset to the replica volume 114B. As set forth below with reference to an illustrative example, in certain asynchronous replication scenarios, the disclosed techniques can improve the performance and/or efficiency of the destination node 110B deployed at the DR site by leveraging host data being mostly written to, rather than read from, replica volumes stored at the DR site.


The disclosed techniques will be further understood with reference to the following illustrative example and FIGS. 1, 2, 3a, 3b, and 4. In this example, it is assumed that the source node 110A (see FIG. 1) has sent, to the destination node 110B (see FIG. 1), a large transfer of replication data, which can have a size of 512 Kb or any other suitable size. It is further assumed that the large transfer of replication data includes changes made to host data at specific offsets of the production volume 114A since the most recent synchronization to the replica volume 114B, in accordance with a specified RPO interval such as 5 minutes or any other suitable interval. For example, the source node 110A may take a snapshot of the production volume 114A at time TO, refresh the snapshot of the production volume 114A at time T0 plus 2.5 minutes (or “T0+2.5”), and perform a snapshot difference (or “snap diff”) operation to obtain snap diff information specifying the data changes at various offsets of the production volume 114A during the elapsed period between the taking of the snapshot at time T0 and the refreshing of the snapshot at time T+2.5. In this example, during the remaining 2.5 minutes of the 5-minute RPO interval, a large replication data transfer operation can be performed to send the data changes from the source node 110A to the destination node 110B. The large replication data transfer operation includes reading the data changes at the various offsets of the production volume 114A, accumulating the data changes for a large replication data transfer to the replica volume 114B, and sending the large transfer of replication data from the source node 114A over the communication path 112 to the destination node 110B.



FIG. 3a depicts an exemplary configuration of the data path processing components 216 of the storage node 110A/B (see FIG. 2). As described herein, the destination node 110B, as well as the source node 110A, can be configured like the storage node 110A/B of FIG. 2. As shown in FIG. 3a, the data path processing components 216 can be configured as a hardware/software stack that includes a frontend component (or “frontend”) 304, layered services 306, namespace components (or “namespace”) 308, a RAM cache 310, and a backend component (or “backend”) 312. Such a data path can be characterized as the path or flow of IO through the destination node 110B. For example, the data path may correspond to the logical flow through software and/or hardware components of the destination node 110B in connection with a user. In an asynchronous replication scenario, such a user can be an application running on the source node 110A that generates and sends the large transfer of replication data (e.g., a large transfer of replication data 302; see FIG. 3a) to the destination node 110B. In this example, upon receipt of the large transfer of replication data 302 at the destination node 110B, the frontend 304 translates it from a protocol-specific data transfer into a storage node-specific data transfer, and passes the large transfer of replication data 302 to the layered services 306.



FIG. 3b depicts an exemplary configuration of the layered services 306, which can be included in the data path processing components 216 of FIG. 3a. As shown in FIG. 3b, the layered services 306 can include a layered services orchestrator 314, which determines data path processing components to be dynamically included in the hardware/software stack of FIG. 3a. In this example, the data path processing components include an usher component (or “usher”) 316, a copier component (or “copier”) 318, a collator 320, and a transit component (or “transit”) 322. The usher 316 receives the large transfer of replication data 302 specifying the data changes from the frontend 304, and obtains a namespace object from the namespace 308 for use in writing the replication data to the replica volume 114B. The copier 318 reads the replication data from the namespace object and provides the replication data to the collator 320, which un-marshals, unpacks, and/or partitions the replication data to create or obtain a plurality of small write requests, each of which can have a size of 4 Kb or any other suitable size.



FIG. 4 depicts an exemplary small write request 402. It is noted that each small (e.g., 4 Kb) write request created or obtained by the storage node 110A/B of FIG. 2 can be configured like the small write request 402 of FIG. 4. In this example, the small write request 402 includes a header 404, a target location 406, and a payload 408. The header 404 of the small write request 402 includes one or more tags 410 indicating a target type 412 (e.g., “replica”), as well as optionally other information 414 (e.g., “hint(s)”, “production site identifier (ID)”, “retention period”). For example, such tag and/or hint information may be implemented as an enumerated (or “enum”) value, which may be added to a small write request at a collator level of layered services for data path processing. The target location 406 identifies a location of the replica volume 114B at the DR site. For example, the target location 406 may be expressed in terms of a logical unit number (LUN) and a logical address or offset location (e.g., a logical block address (LBA)). The payload 408 of the small write request 402 contains partitioned replication data to be written to the replica volume 114B at the target location 406. Having tagged the small write request 402 with the target type 412 (e.g., “replica”), the data contained in the payload 408 of the small write request 402 is written to a cache page in the RAM cache 310 (see FIG. 3a), and marked as being targeted to a “replica”, namely, the replica volume 114B (see FIG. 1). At a later point in time, the backend 312, via the device interfaces 208 (see FIG. 2), de-stages or flushes the data from the cache page to physical storage space (e.g., a PLB) for the replica volume 114B. It is noted that the transit 322 included in the data path processing components of FIG. 3b can be configured as an abstraction layer for supported protocols (e.g., SCSI, NFS, TCP, NVMe-oF) of the source and destination nodes 110A, 110B, which can use the transit 322 to communicate with each other over the communication path 112.


In certain asynchronous replication scenarios, the disclosed techniques can improve the performance and/or efficiency of the destination node 110B deployed at the DR site by leveraging host data being mostly written to, rather than read from, replica volumes stored at the DR site. In one scenario, each write operation performed by the destination node 110B in response to a tagged small write request (e.g., the small write request 402; see FIG. 4) can write data changes at specific offsets of the replica volume 114B to a cache page in the RAM cache 310 (see FIG. 3a). As described in this example, the data contained in the payload 408 of the small write request 402 and written to a cache page in the RAM cache 310 is marked as being targeted to a “replica” such as the replica volume 114B. As such, the destination node 110B flushes the marked data written to the cache page from the RAM cache 310 to the replica volume 114B, and, having flushed and invalidated the data, early evicts the cache page from the RAM cache 310 due to the low probability of the data subsequently being read. In this example, the destination node 110B either returns the clean cache page to a free page list of the RAM cache 310 or places the clean cache page at the head of a least recently used (LRU) list of the RAM cache 310. For example, the free page list may identify free or unused physical locations on an SSD or flash drive. It is noted that clean cache pages on the free page list may be reused at any time. Further, the age of a cache page placed on the LRU list may be determined based on the last time the cache page was accessed in connection with an IO operation. In this way, IO operations directed to production volumes (e.g., the production volume 116; see FIG. 1) stored at the mixed-use DR site can be benefited by such cache pages being made free and available sooner for storing other cached data.


In another scenario, a large transfer of replication data to the replica volume 114A can include a large range or amount of contiguous host data, making the large transfer of replication data to the replica volume 114A a good candidate for inline deep compression. For example, such inline deep compression techniques performed on contiguous host data may provide a higher level of data compression (e.g., a higher compression ratio), thereby producing a more highly compressed version of the host data. In this example, because asynchronous replication is based on a recovery point objective (RPO), write requests issued by the host computers 102 to the production volume 114A are accumulated by the source node 110A over an RPO interval at the production site. Based on the locality of the host data specified in the write requests, the production volume 114A can have large range or amount (e.g., 64 Kb, 128 Kb, 256 Kb) of logically contiguous data blocks written to it during the RPO interval. In this example, asynchronous replication is further based on snap diff technology, which can be used to determine differences between a first snapshot taken of the production volume 114A during the current RPO interval and a second snapshot taken of the production volume 114A during a prior RPO interval. The source node 110A can accumulate the differences between the first and second snapshots of the production volume 114A in a large transfer of replication data to the replica volume 114B, and send the large replication data transfer to the destination node 110B over the communication path 112. The destination node 110B can then perform inline deep compression on the contiguous host data contained in the large replication data transfer, and flush the highly compressed version of the host data to the replica volume 114A. Because host data is mostly written to, rather than read from, replica volumes stored at the DR site, any read penalty resulting from performing such inline deep compression on contiguous host data can be assumed to be low.


In still another scenario, in addition to being tagged with the target type 412, “replica” (see FIG. 4), each small write request (e.g., the small write request 402; see FIG. 4) to the replica volume 114B can also be tagged to identify or indicate a production site where a corresponding source node is deployed. For example, the production site may be one of several production sites, in which each production site mandates a different class of service (CoS) for providing backup or remote storage of host data on its production volumes. In this example, upon receipt of a large transfer of replication data, the destination node 110B partitions it into a plurality of small write requests, tags each small write request with the target type, “replica”, and further tags the small write request with the other information 414 indicating the production site ID where the source node 110A is deployed. The destination node 110B then performs inline stream separation on the multi-tagged small write requests based on target type (“replica”) and the production site ID, and, for each resulting stream of small write request transactions, performs write operations to write data changes at specific offsets to the replica volume for subsequent storage to a storage tier (e.g., an SSD tier, a flash tier, an IDD tier, a cloud tier) that conforms to the CoS mandated by the identified production site.


In yet another scenario, each large transfer of replication data can be tagged with hint information pertaining to a retention period for host data to be written to a replica volume at the DR site. For example, the retention period may indicate a period of time (e.g., hour(s), day(s), week(s), year(s)) for which the host data may not be deleted. In this example, the source node 110A obtains changes made to host data at specific offsets of the production volume 114A since the most recent synchronization to the replica volume 114B, accumulates the data changes in a large transfer of replication data to the replica volume 114B, tags the large transfer of replication data with the hint information pertaining to the retention period for the host data, and sends the tagged large transfer of replication data to the destination node 110B. Upon receipt of the tagged large replication data transfer at the DR site, the destination node 110B partitions it into a plurality of small write requests (e.g., the write request 402; see FIG. 4), tags each small write request with the target type, “replica”, and further tags the small write request with the hint information pertaining to the retention period of the host data. The destination node 110B subsequently flushes the host data having the same retention period to a specific region of physical storage space (e.g., a 2 Mb PLB) for the replica volume 114B. Because host data having the same retention period can be flushed to the same PLB for a replica volume, subsequent deletion of the host data from the PLB at the expiration of the retention period can be performed more efficiently.


A method of a data path strategy for improving the performance and/or efficiency of storage systems deployed at disaster recovery (DR) sites is described below with reference to FIG. 5. As depicted in block 502, a large transfer of replication data including accumulated changes made to data of a production volume since the most recent synchronization to a replica volume is received in an asynchronous replication process. As depicted in block 504, the large transfer of replication data is partitioned into a plurality of small write requests. As depicted in block 506, each small write request is tagged as a write request to the replica volume. As depicted in block 508, in response to servicing the plurality of small write requests, one or more of (i) early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, (ii) deep compression of contiguous host data specified in the plurality of small write requests, (iii) stream separation on the plurality of small write requests, each small write request being further tagged as corresponding to a specific production site, and (iv) flushing host data having the same retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with hint information pertaining to the retention period, are performed. In this way, the performance and/or efficiency of storage systems deployed at DR sites can be improved.


Several definitions of terms are provided below for the purpose of aiding the understanding of the foregoing description, as well as the claims set forth herein.


As employed herein, the term “storage system” is intended to be broadly construed to encompass, for example, private or public cloud computing systems for storing data, as well as systems for storing data comprising virtual infrastructure and those not comprising virtual infrastructure.


As employed herein, the terms “client”, “host”, and “user” refer, interchangeably, to any person, system, or other entity that uses a storage system to read/write data.


As employed herein, the term “storage device” may refer to a storage array including multiple storage devices. Such a storage device may refer to any non-volatile memory (NVM) device, including hard disk drives (HDDs), solid state drives (SSDs), flash devices (e.g., NAND flash devices, NOR flash devices), and/or similar devices that may be accessed locally and/or remotely, such as via a storage area network (SAN).


As employed herein, the term “storage array” may refer to a storage system used for block-based, file-based, or other object-based storage. Such a storage array may include, for example, dedicated storage hardware containing HDDs, SSDs, and/or all-flash drives.


As employed herein, the term “storage entity” or “storage object” may refer to a filesystem, an object storage, a virtualized device, a logical unit (LUN), a logical volume (LV), a logical device, a physical device, and/or a storage medium.


As employed herein, the term “LUN” may refer to a logical entity provided by a storage system for accessing data from the storage system and may be used interchangeably with a logical volume (LV). The term “LUN” may also refer to a logical unit number for identifying a logical unit, a virtual disk, or a virtual LUN.


As employed herein, the term “physical storage unit” may refer to a physical entity such as a storage drive or disk or an array of storage drives or disks for storing data in storage locations accessible at addresses. The term “physical storage unit” may be used interchangeably with the term “physical volume”.


As employed herein, the term “storage medium” may refer to a hard drive or flash storage, a combination of hard drives and flash storage, a combination of hard drives, flash storage, and other storage drives or devices, or any other suitable types and/or combinations of computer readable storage media. Such a storage medium may include physical and logical storage media, multiple levels of virtual-to-physical mappings, and/or disk images. The term “storage medium” may also refer to a computer-readable program medium.


As employed herein, the term “IO request” or “IO” may refer to a data input or output request such as a read request or a write request.


As employed herein, the terms, “such as”, “for example”, “e.g.”, “exemplary”, and variants thereof refer to non-limiting embodiments and have meanings of serving as examples, instances, or illustrations. Any embodiments described herein using such phrases and/or variants are not necessarily to be construed as preferred or more advantageous over other embodiments, and/or to exclude incorporation of features from other embodiments.


As employed herein, the term “optionally” has a meaning that a feature, element, process, etc., may be provided in certain embodiments and may not be provided in certain other embodiments. Any particular embodiment of the present disclosure may include a plurality of optional features unless such features conflict with one another.


While various embodiments of the present disclosure have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the present disclosure, as defined by the appended claims.

Claims
  • 1. A method of improving performance and efficiency of storage systems deployed at disaster recovery (DR) sites, comprising: receiving, at a destination node of a DR site from a source node at a production site, a large transfer of replication data including accumulated changes made to data of a production volume since a most recent synchronization to a replica volume in an asynchronous replication process;partitioning the large transfer of replication data into a plurality of small write requests;tagging each small write request with a first tag indicating a target type, the target type being a replica type of target for the small write request; andin response to servicing the plurality of small write requests, performing one or more of: early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, the early evicting being performed based on the target type being the replica type of target as indicated by the first tag;deep compression of a large amount of logically contiguous blocks in the host data specified in the plurality of small write requests, the deep compression being performed, by the destination node, at a high compression ratio before the large amount of logically contiguous blocks is flushed to the replica volume, the deep compression being performed based on the target type being the replica type of target as indicated by the first tag;stream separation on the plurality of small write requests, each small write request being further tagged with a second tag identifying a production site where the production volume is stored, the production site being a specific production site, the stream separation being performed based on the target type being the replica type of target as indicated by the first tag, and the production site being the specific production site as identified by the second tag; andflushing host data having an identical retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with a third tag containing hint information pertaining to the identical retention period of the host data, the flushing of the host data being performed based on the target type being the replica type of target as indicated by the first tag, and the hint information pertaining to the identical retention period of the host data as contained in the third tag.
  • 2. The method of claim 1 wherein the performing early evicting, from the cache memory, all cache pages used to cache host data specified in the plurality of small write requests includes writing the changes made to the data of the production volume to a cache page in the cache memory.
  • 3. The method of claim 2 wherein the performing early evicting, from the cache memory, all cache pages used to cache host data specified in the plurality of small write requests includes flushing the changes from the cache memory to the replica volume, and evicting the cache page from the cache memory in response to flushing the changes from the cache memory to the replica volume.
  • 4. The method of claim 3 wherein the performing early evicting, from the cache memory, all cache pages used to cache host data specified in the plurality of small write requests includes returning the cache page to a free page list of the cache memory.
  • 5. The method of claim 3 wherein the performing early evicting, from the cache memory, all cache pages used to cache host data specified in the plurality of small write requests includes placing the cache page at a head of a least recently used (LRU) list of the cache memory.
  • 6. The method of claim 1 wherein the performing deep compression of contiguous host data specified in the plurality of small write requests includes performing inline deep compression of the contiguous host data by the destination node.
  • 7. The method of claim 1 wherein the performing deep compression of contiguous host data specified in the plurality of small write requests includes writing a compressed version of the host data to a cache page in the cache memory.
  • 8. The method of claim 7 wherein the performing deep compression of contiguous host data specified in the plurality of small write requests includes flushing the compressed version of the host data from the cache memory to the replica volume.
  • 9. The method of claim 1 wherein the performing stream separation on the plurality of small write requests includes tagging each small write request with the second tag, the second tag including a production site identifier (ID) identifying the specific production site where the production volume is stored.
  • 10. (canceled)
  • 11. The method of claim 9 wherein the performing stream separation on the plurality of small write requests includes performing inline stream separation on the plurality of small write requests based at least on the production site ID included in the second tag.
  • 12. The method of claim 11 wherein the production site mandates a specific class of service (CoS) for providing backup or remote storage of host data, and wherein the performing stream separation on the plurality of small write requests includes, for each stream of small write request transactions resulting from performing the stream separation, performing write operations to write, to the replica volume, the changes made to the data of the production volume for subsequent storage to a storage tier that conforms to the specific CoS mandated by the production site.
  • 13. The method of claim 1 wherein the performing flushing host data having the identical retention period to the specific region of physical storage space for the replica volume includes tagging each small write request with the third tag containing the hint information pertaining to the identical retention period of the host data.
  • 14. (canceled)
  • 15. A system for improving performance and efficiency of storage systems deployed at disaster recovery (DR) sites, comprising: a memory; andprocessing circuitry configured to execute program instructions out of the memory to: receive, at a destination node of a DR site from a source node at a production site, a large transfer of replication data including accumulated changes made to data of a production volume since a most recent synchronization to a replica volume in an asynchronous replication process;partition the large transfer of replication data into a plurality of small write requests;tag each small write request with a first tag indicating a target type, the target type being a replica type of target for the small write request; andin response to servicing the plurality of small write requests, perform one or more of: early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, the early evicting being performed based on the target type being the replica type of target as indicated by the first tag;deep compression of a large amount of logically contiguous blocks in the host data specified in the plurality of small write requests, the deep compression being performed, by the destination node, at a high compression ratio before the large amount of logically contiguous blocks is flushed to the replica volume, the deep compression being performed based on the target type being the replica type of target as indicated by the first tag;stream separation on the plurality of small write requests, each small write request being further tagged as with a second tag identifying a production site where the production volume is stored, the production site being a specific production site, the stream separation being performed based on the target type being the replica type of target as indicated by the first tag, and the production site being the specific production site as identified by the second tag; andflushing host data having an identical retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with a third tag containing hint information pertaining to the identical retention period of the host data, the flushing of the host data being performed based on the target type being the replica type of target as indicated by the first tag, and the hint information pertaining to the identical retention period of the host data as contained in the third tag.
  • 16. The system of claim 15 wherein the processing circuitry is configured to execute the program instructions out of the memory to: write the changes made to the data of the production volume to a cache page in the cache memory; andperform one of (i) returning the cache page to a free page list of the cache memory, and (ii) placing the cache page at a head of a least recently used (LRU) list of the cache memory.
  • 17. The system of claim 15 wherein the processing circuitry is configured to execute the program instructions out of the memory to perform inline deep compression of the contiguous host data site by the destination node.
  • 18. The system of claim 15 wherein the processing circuitry is configured to execute the program instructions out of the memory to: tag each small write request with the second tag, the second tag including a production site identifier (ID) identifying the specific production site where the production volume is stored; andperform inline stream separation on the plurality of small write requests based at least on the production site ID included in the second tag.
  • 19. The system of claim 15 wherein the processing circuitry is configured to execute the program instructions out of the memory to: tag each small write request with the third tag containing the hint information pertaining to the identical retention period of the host data; andflush the host data having the identical retention period to a specific region of physical storage space for the replica volume.
  • 20. A computer program product including a set of non-transitory, computer-readable media having instructions that, when executed by processing circuitry, cause the processing circuitry to perform a method comprising: receiving, at a destination node of a DR site from a source node at a production site, a large transfer of replication data including accumulated changes made to data of a production volume since a most recent synchronization to a replica volume in an asynchronous replication process;partitioning the large transfer of replication data into a plurality of small write requests;tagging each small write request with a first tag indicating a target type, the target type being a replica type of target for the small write request; andin response to servicing the plurality of small write requests, performing one or more of: early evicting, from cache memory, all cache pages used to cache host data specified in the plurality of small write requests, the early evicting being performed based on the target type being the replica type of target as indicated by the first tag;deep compression of a large amount of logically contiguous blocks in the host data specified in the plurality of small write requests, the deep compression being performed, by the destination node, at a high compression ratio before the large amount of logically contiguous blocks is flushed to the replica volume, the deep compression being performed based on the target type being the replica type of target as indicated by the first tag;stream separation on the plurality of small write requests, each small write request being further tagged as with a second tag identifying a production site where the production volume is stored, the production site being a specific production site, the stream separation being performed based on the target type being the replica type of target as indicated by the first tag, and the production site being the specific production site as identified by the second tag; andflushing host data having an identical retention period to a specific region of physical storage space for the replica volume, each small write request being further tagged with a third tag containing hint information pertaining to the identical retention period of the host data, the flushing of the host data being performed based on the target type being the replica type of target as indicated by the first tag, and the hint information pertaining to the identical retention period of the host data as contained in the third tag.