Content-based storage (sometimes referred to as content-addressable storage or CAS) stores data based on its content, providing inherent data deduplication and facilitating in-line data compress, among other benefits. Some existing content-based storage systems may provide data backup and replication capabilities. For example, snapshots of a given storage volume may be made at arbitrary points in time and replicated to a remote system (e.g., another content-based storage system). Consecutive snapshots may be compared to identify which data in the volume changed and, thus, needs to be transmitted to the remote system. Between any two consecutive snapshots, the storage system may process an arbitrary number of I/O writes for the storage volume.
Some storage systems allow a so-called “recovery point objective” (RPO) period to be defined by a storage administrator or other user. An RPO period may specify the maximum targeted time period in which data might be lost (e.g., due to corruption or disk failure). Existing storage systems may automatically generate and replicate snapshots at some frequency determined by the RPO.
It is appreciated herein that amount of data that needs to be transmitted during each replication cycle is generally unknown, whereas the targeted maximum length of a replication cycle may be user-defined. For example, a user that defines a 30 second RPO period would like to have all the data transmitted inside this window. It is further appreciated that transmitting data from a storage system may consume system resources (e.g., network bandwidth and processing cycles) shared by other processes including those that process I/O reads and writes. System performance may be improved by throttling replication data transmissions using a technique referred to herein as “link smoothing.”
According to one aspect of the disclosure, a method comprises: determining one or more slices of a logical address space assigned to replication processor; determining an elapsed time since a start of a replication cycle; determining an expected number of slices that should have been replicated based on the elapsed time; and replicating one or more slices of the logical address space in response to determining the expected number of slices that should have been replicated is less than an actual number of slices replicated by the replication processor within the replication cycle.
In some embodiments, determining an expected number of slices that should have been replicated is further based on a recovery point objective (RPO) and the number of slices of the logical address space assigned to replication processor. In certain embodiments, determining an expected number of slices that should have been replicated is further based upon an acceleration factor. In particular embodiments, the method further comprises assigning each slice of a logical address space to one of a plurality of replication processors.
According to another aspect of the disclosure, a system comprises one or more processors; a volatile memory; and a non-volatile memory storing computer program code that when executed on the processor causes execution across the one or more processors of a process operable to perform embodiments of the method described hereinabove.
According to yet another aspect of the disclosure, a computer program product tangibly embodied in a non-transitory computer-readable medium, the computer-readable medium storing program instructions that are executable to perform embodiments of the method described hereinabove.
The foregoing features may be more fully understood from the following description of the drawings in which:
The drawings are not necessarily to scale, or inclusive of all elements of a system, emphasis instead generally being placed upon illustrating the concepts, structures, and techniques sought to be protected herein.
Before describing embodiments of the structures and techniques sought to be protected herein, some terms are explained. As used herein, the term “storage system” may be broadly construed so as to encompass, for example, private or public cloud computing systems for storing data as well as systems for storing data comprising virtual infrastructure and those not comprising virtual infrastructure. As used herein, the terms “host,” “client,” and “user” may refer to any person, system, or other entity that uses a storage system to read/write data.
As used herein, the terms “disk” and “storage device” may refer to any non-volatile memory (NVM) device, including hard disk drives (HDDs), flash devices (e.g., NAND flash devices), and next generation NVM devices, any of which can be accessed locally and/or remotely (e.g., via a storage attached network (SAN)). The term “storage array” may be used herein to refer to any collection of storage devices. As used herein, the term “memory” may refer to volatile memory used by the storage system, such as dynamic random access memory (DRAM).
As used herein, the terms “I/O read request” and “I/O read” refer to a request to read data. The terms “I/O write request” and “I/O write” refer to a request to write data. The terms “I/O request” and “I/O” refer to a request that may be either an I/O read request or an I/O write request. As used herein, the terms “logical I/O address” and “I/O address” refer to a logical address used by hosts to read/write data from/to a storage system.
While vendor-specific terminology may be used herein to facilitate understanding, it is understood that the concepts, techniques, and structures sought to be protected herein are not limited to use with any specific commercial products.
In the embodiment shown, the subsystems 102 include a routing subsystem 102a, a control subsystem 102b, a data subsystem 102c, and a replication subsystem 102d. In one embodiment, the subsystems 102 may be provided as software modules, i.e., computer program code that, when executed on a processor, may cause a computer to perform functionality described herein. In a certain embodiment, the storage system 100 includes an operating system (OS) and one or more of the subsystems 102 may be provided as user space processes executable by the OS. In other embodiments, a subsystem 102 may be provided, at least in part, as hardware such as digital signal processor (DSP) or an application specific integrated circuit (ASIC) configured to perform functionality described herein.
The routing subsystem 102a may be configured to receive I/O requests from clients 116 and to translate client requests into internal commands. Each I/O request may be associated with a particular volume and may include one or more I/O addresses (i.e., logical addresses within that volume). The storage system 100 stores data in fixed-size chunks, for example 4 KB chunks, where each chunk is uniquely identified within the system using a “hash” value that is derived from the data/content stored within the chunk. The routing subsystem 102a may be configured to convert an I/O request for an arbitrary amount of data into one or more internal I/O requests each for a chunk-sized amount of data. The internal I/O requests may be sent to one or more available control subsystems 102b for processing. In some embodiments, the routing subsystem 102a is configured to receive Small Computer System Interface (SCSI) commands from clients. In certain embodiments, I/O requests may include one or more logical block addresses (LBAs).
For example, if a client 116 sends a request to write 8 KB of data starting at logical address zero (0), the routing subsystem 102a may split the data into two 4 KB chunks, generate a first internal I/O request to write 4 KB of data to logical address zero (0), and generate a second internal I/O request to write 4 KB of data to logical address one (1). The routing subsystem 102a may calculate hash values for each chunk of data to be written, and send the hashes to the control subsystem(s) 102b. In one embodiment, chunk hashes are calculated using a Secure Hash Algorithm 1 (SHA-1).
As another example, if a client 116 sends a request to read 8 KB of data starting at logical address one (1), the routing subsystem 102a may generate a first internal I/O request to read 4 KB of data from address zero (0) and a second internal I/O request to read 4 KB of data to address one (1).
The control subsystem 102b may also be configured to clone storage volumes and to generate snapshots of storage volumes using techniques known in the art. For each volume/snapshot, the control subsystem 102b may maintain a so-called “address-to-hash” (A2H) tables 112 that maps I/O addresses to hash values of the data stored at those logical addresses.
The data subsystem 102c may be configured to maintain one or more so-called “hash-to-physical address” (H2P) tables 114 that map chunk hash values to physical storage addresses (e.g., storage locations within the storage array 106 and/or within individual disks 108). Using the H2P tables 114, the data subsystem 102c handles reading/writing chunk data from/to the storage array 106. The H2P table may also include per-chunk metadata such as a compression ratio and a reference count. A chunk compression ratio indicates the size of the compressed chunk stored on disk compared to the uncompressed chunk size. For example, a compression ratio of 0.25 may indicate that the compressed chunk on disk is 25% smaller compared to its original size. A chunk reference count may indicate the number of times that the chunk's hash appears within A2H tables. For example, if the same chunk data is stored at two different logical addresses with the same volume/snapshots (or within two different volumes/snapshots), the H2P table may indicate that the chunk has a reference count of two (2).
It will be appreciated that combinations of the A2H 112 and H2P 114 tables can provide multiple levels of indirection between the logical (or “I/O”) address a client 116 uses to access data and the physical address where that data is stored. Among other advantages, this may give the storage system 100 freedom to move data within the storage array 106 without affecting a client's 116 access to that data (e.g., if a disk 108 fails). In some embodiments, an A2H 112 table and/or an H2P 114 table may be stored in memory.
The replication subsystem 102d may be configured to replicate data from the storage system 100 to a remote system (e.g., another storage system). In some embodiments, the replication subsystem 102d may automatically replicate on one or more storage volumes based on defined RPO periods. Within a replication cycle, the replication subsystem 102d may cause a volume snapshot to be generated, determine which data has changed within the volume since the previous replication cycle (e.g., by comparing consecutive snapshots), and transmit the changed data to the remote system. In various embodiments, a replication subsystem 102d may be included with a control subsystem 102b.
In some embodiments, storage system 100 divides the address space of one or more storage volumes (e.g., LUNs) into a plurality of smaller logical address spaces referred to herein as “slices.” Each slice may be assigned to one of a plurality of control subsystems 102b and/or replication subsystems 102d to balance load within the storage system 100.
In various embodiments, a replication subsystem 102d may perform link smoothing to reduce network/processing load within the system 100 during a replication cycle. In certain embodiments, replication subsystem 102d implements at least a portion of the processing described below in conjunction with
In some embodiments, storage system 100 corresponds to a node within a distributed storage system having a plurality of nodes, each of which may include one or more of the subsystems 102a-102d.
In one embodiment, the system 100 includes features used in EMC® XTREMIO®.
Referring to
A snapshot of a storage volume may be generated by making a copy of the volume's A2H table. The copied table represents the contents of the volume at a particular point in time and, for example, can be used to revert the state of the volume to that point in time. Two A2H tables may be compared to determine which chunks of data within a volume were written (i.e., modified or added) between the respective points in time. For example, referring to the example of
Referring to
As shown in
As shown in
In various embodiments, each slice may be assigned to a particular control processor (e.g., an instance of control subsystem 102b in
In some embodiments, each slice may be assigned to a particular replication processor (e.g., an instance of replication subsystem 102d in
It will be appreciated herein that I/O load and/or replication load may be balanced across multiple processors using the above-described techniques. In general, I/O writes may not be uniformly distributed over a logical address space. For example, certain volumes may receive more I/O writes than others, and certain regions within a given volume's logical address space may receive a disproportionate number of I/O writes. Choosing and assigning slices that uniformly sample an address space can be an effective means to distribute I/O load within a storage system. For example, multiple replication processors can each scan distinct slices assigned thereto, identifying and transmitting data that has changed. When all processors have completed their scans, the entire address space (e.g., all 1024 slices) will have been replicated. In some embodiments, a replication processor may throttle data transmissions using slices such as described below in conjunction with
Referring to
At block 402, one or more slices of a logical address space that are assigned to the replication processor are determined. In some embodiments, the assignment of slices to processor may be configured by a user (e.g., within a configuration file). In particular embodiments, a user specifies a number of replication processor to use, and the storage system automatically assigns slices to the replication processor. For example, if 16 replication processor are specified and 1024 slices are used, then the storage system may assign 64 slices to each replication processor.
At block 404, an amount of time elapsed since the start of the replication cycle is determined. At block 406, the maximum (or “expected”) number of slices that should have been replicated based on the replication cycle elapsed time is determined. In some embodiments, the expected number of slices (slicesexp) may be calculated as follows:
where t is the elapsed time, T is a desired cycle time (e.g., based on a user-defined RPO), and S is the number of slices assigned to the replication processor.
In some embodiments, a so-called “acceleration factor” (F) may be used to allow the replication processor to complete ahead of schedule (in the nominal case), so as to increase the probability that the RPO is achieved (e.g., even if a system slowdown occurs). In such embodiments, the expected number of slices (slicesexp) may be calculated as follows:
In one embodiment, F=1.1 (i.e., the acceleration factor may be selected to provide a 10% margin for the replication cycle).
At block 408, if the number of slices replicated by the replication processor within the current replication cycle is less than the expected number, then one or more additional slices may be replicated (block 412). For example, the number of slices replicated at block 412 may be equal to the expected number minus the actual number of slices replicated by the replication processor within the replication cycle. Otherwise, the replication processor may wait (or “sleep”, block 410) for some amount of time before repeating conditional block 408.
At block 414, if all slices assigned to the replication processor have been replicated, then the process may end. Otherwise, the method may continue from block 404.
In the embodiment shown, computer instructions 512 may include routing subsystem instructions 512a that may correspond to an implementation of a routing subsystem 102a (
Processing may be implemented in hardware, software, or a combination of the two. In various embodiments, processing is provided by computer programs executing on programmable computers/machines that each includes a processor, a storage medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform processing and to generate output information.
The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate.
Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
All references cited herein are hereby incorporated herein by reference in their entirety.
Having described certain embodiments, which serve to illustrate various concepts, structures, and techniques sought to be protected herein, it will be apparent to those of ordinary skill in the art that other embodiments incorporating these concepts, structures, and techniques may be used. Elements of different embodiments described hereinabove may be combined to form other embodiments not specifically set forth above and, further, elements described in the context of a single embodiment may be provided separately or in any suitable sub-combination. Accordingly, it is submitted that the scope of protection sought herein should not be limited to the described embodiments but rather should be limited only by the spirit and scope of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6496908 | Kamvysselis et al. | Dec 2002 | B1 |
6553464 | Kamvysselis et al. | Apr 2003 | B1 |
6640280 | Karrivysselis et al. | Oct 2003 | B1 |
6862632 | Halstead et al. | Mar 2005 | B1 |
6883018 | Meiri et al. | Apr 2005 | B1 |
6886164 | Meiri | Apr 2005 | B2 |
6898685 | Meiri et al. | May 2005 | B2 |
6910075 | Marshak et al. | Jun 2005 | B2 |
6938122 | Meiri et al. | Aug 2005 | B2 |
6944726 | Yoder et al. | Sep 2005 | B2 |
6968369 | Veprinsky et al. | Nov 2005 | B2 |
6976139 | Halstead et al. | Dec 2005 | B2 |
7000086 | Meiri et al. | Feb 2006 | B2 |
7024525 | Yoder et al. | Apr 2006 | B2 |
7032228 | McGillis et al. | Apr 2006 | B1 |
7051176 | Meiri et al. | May 2006 | B2 |
7054883 | Meiri et al. | May 2006 | B2 |
7113945 | Moreshet et al. | Sep 2006 | B1 |
7114033 | Longinov et al. | Sep 2006 | B2 |
7174423 | Meiri et al. | Feb 2007 | B2 |
7197616 | Meiri et al. | Mar 2007 | B2 |
7228456 | Lecrone et al. | Jun 2007 | B2 |
7240116 | Marshak et al. | Jul 2007 | B2 |
7292969 | Aharoni et al. | Nov 2007 | B1 |
7376651 | Moreshet et al. | May 2008 | B2 |
7380082 | Meiri et al. | May 2008 | B2 |
7383385 | Meiri et al. | Jun 2008 | B2 |
7383408 | Meiri et al. | Jun 2008 | B2 |
7386668 | Longinov et al. | Jun 2008 | B2 |
7392360 | Aharoni et al. | Jun 2008 | B1 |
7409470 | Halstead et al. | Aug 2008 | B2 |
7430589 | Veprinsky et al. | Sep 2008 | B2 |
7577957 | Kamvysselis et al. | Aug 2009 | B1 |
7613890 | Meiri | Nov 2009 | B1 |
7617372 | Bjornsson et al. | Nov 2009 | B1 |
7702871 | Arnon et al. | Apr 2010 | B1 |
7870195 | Meir | Jan 2011 | B1 |
8046545 | Meiri et al. | Oct 2011 | B2 |
8078813 | LeCrone et al. | Dec 2011 | B2 |
8332687 | Natanzon et al. | Dec 2012 | B1 |
8335771 | Natanzon et al. | Dec 2012 | B1 |
8335899 | Meiri et al. | Dec 2012 | B1 |
8468180 | Meiri et al. | Jun 2013 | B1 |
8578204 | Ortenberg et al. | Nov 2013 | B1 |
8600943 | Fitzgerald et al. | Dec 2013 | B1 |
8612700 | Armstrong | Dec 2013 | B1 |
8677087 | Meiri et al. | Mar 2014 | B2 |
8694700 | Natanzon et al. | Apr 2014 | B1 |
8706959 | Arnon et al. | Apr 2014 | B1 |
8719497 | Don et al. | May 2014 | B1 |
8732124 | Arnon et al. | May 2014 | B1 |
8782357 | Halstead et al. | Jul 2014 | B2 |
8812595 | Meiri et al. | Aug 2014 | B2 |
8825964 | Sopka et al. | Sep 2014 | B1 |
8838849 | Meiri et al. | Sep 2014 | B1 |
8862546 | Natanzon et al. | Oct 2014 | B1 |
8914596 | Lecrone et al. | Dec 2014 | B2 |
8966211 | Arnon et al. | Feb 2015 | B1 |
8977826 | Meiri et al. | Mar 2015 | B1 |
9002904 | Meiri et al. | Apr 2015 | B1 |
9009437 | Bjornsson et al. | Apr 2015 | B1 |
9026492 | Shorey et al. | May 2015 | B1 |
9026696 | Natanzon et al. | May 2015 | B1 |
9037816 | Halstead et al. | May 2015 | B1 |
9037822 | Meiri et al. | May 2015 | B1 |
9100343 | Riordan et al. | Aug 2015 | B1 |
9110693 | Meiri et al. | Aug 2015 | B1 |
9141290 | Hallak et al. | Sep 2015 | B2 |
9304889 | Chen | Apr 2016 | B1 |
9323750 | Natanzon et al. | Apr 2016 | B2 |
9342465 | Meiri | May 2016 | B1 |
9378106 | Ben-Moshe et al. | Jun 2016 | B1 |
9396243 | Halevi et al. | Jul 2016 | B1 |
9418131 | Halevi et al. | Aug 2016 | B1 |
9483355 | Meiri et al. | Nov 2016 | B1 |
9524220 | Veprinsky et al. | Dec 2016 | B1 |
9558083 | LeCrone et al. | Jan 2017 | B2 |
9606739 | LeCrone et al. | Mar 2017 | B1 |
9606870 | Meiri et al. | Mar 2017 | B1 |
9753663 | LeCrone et al. | Sep 2017 | B1 |
9959063 | Meiri et al. | May 2018 | B1 |
9959073 | Meiri | May 2018 | B1 |
10007466 | Meiri et al. | Jun 2018 | B1 |
10025843 | Meiri et al. | Jul 2018 | B1 |
10055161 | Meiri et al. | Aug 2018 | B1 |
10083094 | Thomas | Sep 2018 | B1 |
10095428 | Meiri et al. | Oct 2018 | B1 |
10152527 | Meiri et al. | Dec 2018 | B1 |
10238487 | Alon et al. | Mar 2019 | B2 |
10261853 | Chen et al. | Apr 2019 | B1 |
20070033589 | Nicholas | Feb 2007 | A1 |
20090037500 | Kirshenbaum | Feb 2009 | A1 |
20120089572 | Raichstein | Apr 2012 | A1 |
20130173537 | Wilkinson | Jul 2013 | A1 |
20140149350 | Chen | May 2014 | A1 |
20140344222 | Morris | Nov 2014 | A1 |
20180024894 | Naik | Jan 2018 | A1 |
20180095672 | Rueger et al. | Apr 2018 | A1 |
Number | Date | Country |
---|---|---|
WO 2012066528 | May 2012 | WO |
Entry |
---|
U.S. Appl. No. 15/001,789, filed Jan. 20, 2016, Meiri et al. |
U.S. Appl. No. 15/076,946, filed Mar. 22, 2016, Meiri |
U.S. Appl. No. 15/085,188, filed Mar. 30, 2016, Meiri et al. |
U.S. Appl. No. 15/499,226, filed Apr. 27, 2017, Meiri et al. |
U.S. Appl. No. 15/499,199, filed Apr. 27, 2017, Stronge et al. |
U.S. Appl. No. 15/797,329, filed Oct. 30, 2017, Parasnis et al. |
U.S. Appl. No. 15/971,153, filed May 4, 2018, Meiri et al. |
U.S. Appl. No. 15/971,310, filed May 4, 2018, Kucherov et al. |
U.S. Appl. No. 15/971,325, filed May 4, 2018, Kucherov et al. |
U.S. Appl. No. 15/971,445, filed May 4, 2018, Kucherov et al. |
U.S. Appl. No. 16/050,247, filed Jul. 31, 2018, Schneider et al. |
U.S. Appl. No. 16/177,782, filed Nov. 1, 2018, Hu et al. |
U.S. Appl. No. 16/264,825, filed Feb. 1, 2019, Chen et al. |
U.S. Appl. No. 16/263,414, filed Jan. 31, 2019, Meiri et al. |
U.S. Appl. No. 16/395,595, filed Apr. 26, 2019, Meiri et al. |
U.S. Appl. No. 16/396,880, filed Apr. 29, 2019, Meiri et al. |
U.S. Appl. No. 16/398,595, filed Apr. 30, 2019, Kucherov et al. |
Non-Final Office Action dated Apr. 1, 2019, U.S. Appl. No. 15/499,199, 10 pages. |