This application claims priority to Russian Patent Application number 2016148857, filed Dec. 13, 2016, and entitled “EFFICIENT MIGRATION TO DISTRIBUTED STORAGE,” which is incorporated herein by reference in its entirety.
Distributed storage systems may provide a wide range of storage services, while achieving high scalability, availability, and serviceability. An example of a distributed storage system is Elastic Cloud Storage (ECS) from Dell EMC of Hopkinton, Mass. Distributed storage systems may employ erasure coding or another data protection scheme to protect against data loss.
Some distributed storage systems may provide data migration solutions to facilitate transferring data between two or more data storages (e.g., to move legacy data to a new storage system). Many existing storage systems use a “push” migration mode, wherein a host reads data from one storage system (referred to as “source storage”) and writes/pushes data to a second storage system (referred to as “target storage”) using standard user I/O facilities of the target storage.
It is appreciated herein that, in existing systems, data migration may be inefficient due to overhead from a data protection scheme. In particular, writing migration data to target storage using standard user I/O facilitates may invoke intermediate data protection processing that may not be essential in the context of data migration.
In accordance with one aspect of the disclosure, a method comprises: determining a list of objects within source storage to migrate; generating a chunk layout for the objects to migrate; and for each unencoded chunk within the chunk layout: retrieving objects from source storage specified by the unencoded chunk within the chunk layout; generating data and coded fragments for the unencoded chunk using the retrieved objects; and storing the data and coded fragments to primary storage.
In some embodiments, determining the list of objects within source storage to migrate includes querying the source storage. In certain embodiments, querying the source storage includes retrieving a list of object sizes from source storage. In particular embodiments, generating the chunk layout includes placing related objects within contiguous chunk segments and/or within contiguous chunks. In many embodiments, generating data and coded fragments for each unencoded chunk within the chunk layout includes generating data and coded fragments in parallel across multiple nodes of distributed storage system.
According to another aspect of the disclosure, a system comprises one or more processors; a volatile memory; and a non-volatile memory storing computer program code that when executed on the processor causes execution across the one or more processors of a process operable to perform embodiments of the method described hereinabove.
According to yet another aspect of the disclosure, a computer program product tangibly embodied in a non-transitory computer-readable medium, the computer-readable medium storing program instructions that are executable to perform embodiments of the method described hereinabove.
The foregoing features may be more fully understood from the following description of the drawings in which:
The drawings are not necessarily to scale, or inclusive of all elements of a system, emphasis instead generally being placed upon illustrating the concepts, structures, and techniques sought to be protected herein.
Before describing embodiments of the structures and techniques sought to be protected herein, some terms are explained. As used herein, the term “storage system” encompass, for example, private or public cloud computing systems for storing data as well as systems for storing data comprising virtual infrastructure and those not comprising virtual infrastructure. As used herein, the terms “client” and “user” may refer to any person, system, or other entity that uses a storage system to read/write data. The term “I/O request” or simply “I/O” may be used herein to refer to a request to read or write data.
As used herein, the term “storage device” may refer to any non-volatile memory (NVM) device, including hard disk drives (HDDs), flash devices (e.g., NAND flash devices), and next generation NVM devices, any of which can be accessed locally and/or remotely (e.g., via a storage attached network (SAN)). The term “storage device” may also refer to a storage array comprising one or more storage devices.
The storage cluster 104 includes one or more storage nodes 106a . . . 106n (generally denoted 106). A storage node 106 may include one or more services 108 and one or more storage devices 110, as shown. In some embodiments, a storage node 106 may include the following services 108: an authentication service to authenticate requests from clients 102; storage API services to parse and interpret requests from clients 102; a storage chunk management service to facilitate storage chunk allocation/reclamation for different storage system needs and monitor storage chunk health and usage; a storage server management service to manage available storage devices capacity and to track storage devices states; and a storage server service to interface with the storage devices 110.
In particular embodiments, the distributed storage system 100 may use erasure coding to protect against data loss. In certain embodiments, storage node services 108 may include a migration service to efficiently migrate data into a storage system 100 that utilizes erasure coding. In some embodiments, the migration service performs at least a portion of the processing described below in conjunction with
In many embodiments, a storage node 106 may include a processor and a non-volatile memory storing computer program code that when executed on the processor causes the processor to execute processes operable to perform functions of the services 108.
Storage devices 110 comprise one or more physical and/or logical storage devices attached to a storage node 106. In certain embodiments, storage devices 110 may be provided as a storage array. In particular embodiments, storage devices 110 may be provided as VNX or Symmetrix VMAX, which are available from Dell EMC of Hopkinton, Mass.
In general operation, a client 102 may send an I/O request to the storage cluster 104. The request may be received by any available storage node 106. The receiving node 106 may process the request locally and/or may delegate request processing to one or more other nodes 106 (referred to herein as its “peer nodes”). In some embodiments, client data may be split into fixed size pieces (referred to herein as “chunks”) for storage within the cluster 104. In some embodiments, padding can be added to a chunk to ensure that all chunks are of equal size. In certain embodiments, each chunk may be 128 Mb in size.
In some embodiments, a distributed storage system 100 may be an Elastic Cloud Storage (ECS) system from Dell EMC of Hopkinton, Mass.
In certain embodiments, the data and coded fragments may be generated using techniques described in co-owned U.S. patent application Ser. No. 15/193,407, filed on Jun. 27, 2016 and entitled ERASURE CODING FOR ELASTIC CLOUD STORAGE, which is herein incorporated by reference in its entirety.
As shown in
When a client writes data D to the cluster, the data D may be split into k equal size data fragments D1, D2, . . . , Dk, with padding or other data complement being added as needed to ensure the data fragments are of equal size. In some embodiments, if a data fragment D1, D2, . . . , Dk is lost (e.g., due to a node failure, a storage device failure, or data corruption), the lost data fragment may be regenerated using available data fragments D1, D2, . . . , Dk, and redundant information within available coded fragments C1, C2, . . . , Cm. In certain embodiments, at least k unique available fragments—either data fragments or coded fragments—may be required to decode a lost data fragment. Thus, according to some embodiments, the system 200 can tolerate the loss of any fragments.
In many embodiments, the storage system 200 uses a delayed erasure coding process, while providing “intermediate” data protection between the time the data is initially stored and the time erasure coding completes. In particular, the storage system 200 may store multiple copies of the data D before sending an acknowledgement to the client. In one embodiment, the system 200 stores three (3) copies of the data D before sending an acknowledgement. In some embodiments, each copy of the data D is stored on a separate node 201-216. After the client is acknowledged, a designated (or “owner”) node enqueues a local erasure coding task to perform erasure encoding on the data D as describe above. After the erasure coding task completes, the copies of data D may be deleted.
Referring to
The migration service 304 may be configured to perform “pull migration” against data source storage 302. In particular, the migration service 304 reads data from source storage 302 and writes it to target storage 300 using one or more internal services of the target storage. It is appreciated herein that some or all of the “intermediate” data protection scheme described above may be unnecessary for migrated data because the data being migrated may already be protected by source storage 302 and because access to the new data within target storage 300 can be controlled.
In various embodiments, target storage 300 and/or source storage 302 may be provided as object stores, meaning that user data is stored in arbitrary-sized blobs referred to as “objects.” In some embodiments, the migration service 304 determines a list of objects to be migrated from source storage 302, generates a so-called “chunk layout” for the objects, and then requests that encoding be performed for one or more chunks defined by the chunk layout. In certain embodiments, the migration service sends encoding requests to a chunk encoding service (e.g., service 306 in
In certain embodiments, a migration service (e.g., migration service 304 in
Alternatively, the processing and decision blocks may represent steps performed by functionally equivalent circuits such as a digital signal processor (DSP) circuit or an application specific integrated circuit (ASIC). The flow diagrams do not depict the syntax of any particular programming language but rather illustrate the functional information one of ordinary skill in the art requires to fabricate circuits or to generate computer software to perform the processing required of the particular apparatus. It should be noted that many routine program elements, such as initialization of loops and variables and the use of temporary variables may be omitted for clarity. The particular sequence of blocks described is illustrative only and can be varied without departing from the spirit of the concepts, structures, and techniques sought to be protected herein. Thus, unless otherwise stated, the blocks described below are unordered meaning that, when possible, the functions represented by the blocks can be performed in any convenient or desirable order. In some embodiments, the processing and decision blocks represent states and transitions, respectively, within a finite-state machine, which can be implemented in software and/or hardware.
Referring to
Referring again to
Referring back to
In various embodiments, the process 400 may be executed in a distributed, parallel manner across multiple nodes of the target storage system. In some embodiments, each node may execute one or more control threads to perform the processing described hereinabove.
In some embodiments, the so-called “pull migration” scheme described herein can reduce network and/or I/O traffic within a target storage system. In certain embodiments, migrated objects may have greater spatial locality within the target storage, reducing the amount of system metadata required to track the objects and increasing object read performance. In various embodiments, pull migration can improve garbage collection performance within target storage because related objects will tend to be stored in the same or contiguous chunks.
In some embodiments, a non-transitory computer readable medium 520 may be provided on which a computer program product may be tangibly embodied. The non-transitory computer-readable medium 520 may store program instructions that are executable to perform the processing of
Referring again to
The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate. The program logic may be run on a physical or virtual processor. The program logic may be run across one or more physical or virtual processors.
Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
All references cited herein are hereby incorporated herein by reference in their entirety.
Having described certain embodiments, which serve to illustrate various concepts, structures, and techniques sought to be protected herein, it will be apparent to those of ordinary skill in the art that other embodiments incorporating these concepts, structures, and techniques may be used. Elements of different embodiments described hereinabove may be combined to form other embodiments not specifically set forth above and, further, elements described in the context of a single embodiment may be provided separately or in any suitable sub-combination. Accordingly, it is submitted that the scope of protection sought herein should not be limited to the described embodiments but rather should be limited only by the spirit and scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
2016148857 | Dec 2016 | RU | national |
Number | Name | Date | Kind |
---|---|---|---|
6070003 | Gove et al. | May 2000 | A |
6550035 | Okita | Apr 2003 | B1 |
7549110 | Stek et al. | Jun 2009 | B2 |
7559007 | Wilkie | Jul 2009 | B1 |
7581156 | Manasse | Aug 2009 | B2 |
8458515 | Saeed | Jun 2013 | B1 |
8532212 | Ito | Sep 2013 | B2 |
8683296 | Anderson et al. | Mar 2014 | B2 |
8683300 | Stek et al. | Mar 2014 | B2 |
8762642 | Bates et al. | Jun 2014 | B2 |
8914706 | Anderson | Dec 2014 | B2 |
9753807 | Donlan et al. | Sep 2017 | B1 |
9921910 | Sangamkar | Mar 2018 | B2 |
20050038968 | Iwamura et al. | Feb 2005 | A1 |
20060105724 | Nakao | May 2006 | A1 |
20060147219 | Yoshino et al. | Jul 2006 | A1 |
20060155946 | Ji | Jul 2006 | A1 |
20080126357 | Casanova et al. | May 2008 | A1 |
20090112953 | Barsness et al. | Apr 2009 | A1 |
20100037056 | Follis et al. | Feb 2010 | A1 |
20100091842 | Ikeda et al. | Apr 2010 | A1 |
20100180176 | Yosoku et al. | Jul 2010 | A1 |
20100246663 | Citta et al. | Sep 2010 | A1 |
20110029840 | Ozzie et al. | Feb 2011 | A1 |
20110053639 | Etienne Suanez et al. | Mar 2011 | A1 |
20110055494 | Roberts et al. | Mar 2011 | A1 |
20110184997 | Grube et al. | Jul 2011 | A1 |
20110196900 | Drobychev et al. | Aug 2011 | A1 |
20120047339 | Decasper et al. | Feb 2012 | A1 |
20120051208 | Li et al. | Mar 2012 | A1 |
20120060072 | Simitci et al. | Mar 2012 | A1 |
20120106595 | Bhattad et al. | May 2012 | A1 |
20130067187 | Moss et al. | Mar 2013 | A1 |
20130159603 | Whitney | Jun 2013 | A1 |
20140046997 | Dain et al. | Feb 2014 | A1 |
20140201541 | Paul et al. | Jul 2014 | A1 |
20140380125 | Calder et al. | Dec 2014 | A1 |
20150106578 | Warfield | Apr 2015 | A1 |
20150378626 | Motwani | Dec 2015 | A1 |
20160092109 | Wu et al. | Mar 2016 | A1 |
20160239384 | Slik et al. | Aug 2016 | A1 |
20160246677 | Sangamkar et al. | Aug 2016 | A1 |
20170046127 | Fletcher et al. | Feb 2017 | A1 |
20170075947 | Kurilov et al. | Mar 2017 | A1 |
20170083549 | Danilov et al. | Mar 2017 | A1 |
20170242732 | Vairavanathan et al. | Aug 2017 | A1 |
20170277915 | Slik | Sep 2017 | A1 |
20170286436 | Neporada | Oct 2017 | A1 |
Entry |
---|
U.S. Appl. No. 15/620,892, filed Jun. 13, 2017, Danilov et al. |
U.S. Appl. No. 15/620,898, filed Jun. 13, 2017, Danilov et al. |
U.S. Appl. No. 15/620,900, filed Jun. 13, 2017, Danilov et al. |
U.S. Appl. No. 15/193,144, filed Jun. 27, 2016, Kurilov et al. |
U.S. Appl. No. 15/193,141, filed Jun. 27, 2016, Danilov et al. |
U.S. Appl. No. 15/186,576, filed Jun. 20, 2016, Malygin et al. |
U.S. Appl. No. 15/193,145, filed Jun. 27, 2016, Fomin et al. |
U.S. Appl. No. 15/193,407, filed Jun. 27, 2016, Danilov et al. |
U.S. Appl. No. 15/193,142, filed Jun. 27, 2016, Danilov et al. |
U.S. Appl. No. 15/193,409, filed Jun. 27, 2016, Trusov et al. |
U.S. Appl. No. 15/281,172, filed Sep. 30, 2016, Trusov et al. |
U.S. Appl. No. 15/398,832, filed Jan. 5, 2017, Danilov et al. |
U.S. Appl. No. 15/398,826, filed Jan. 5, 2017, Danilov et al. |
U.S. Appl. No. 15/398,819, filed Jan. 5, 2017, Danilov et al. |
Office Action dated Nov. 27, 2017 from U.S. Appl. No. 15/186,576; 11 Pages. |
Office Action dated Dec. 14, 2017 from U.S. Appl. No. 15/281,172; 9 Pages. |
Response to Office Action dated Sep. 15, 2017 from U.S. Appl. No. 15/193,409, filed Dec. 14, 2017; 11 Pages. |
Response to Office Action dated Oct. 5, 2017 from U.S. Appl. No. 15/193,407, filed Dec. 20, 2017; 12 Pages. |
Response to Office Action dated Oct. 18, 2017 from U.S. Appl. No. 15/193,145, filed Jan. 17, 2018; 12 Pages. |
U.S. Non-Final Office Action dated Feb. 2, 2018 for U.S. Appl. No. 15/398,826; 16 Pages. |
Final Office Action dated Jun. 19, 2018 for U.S. Appl. No. 15/398,826; 8 pages. |
Response to U.S. Non-Final Office Action dated Dec. 14, 2017 for U.S. Appl. No. 15/281,172; Response Filed on Apr. 9, 2018; 12 pages. |
Anvin, “The mathematics of RAID-6;” Zytor; Dec. 20, 2011; 9 Pages. |
Blomer et al.; “An XOR-Based Erasure-Resilient Coding Scheme;” International Computer Science Institute, Berkley, California; 1995; 19 Pages. |
Response to U.S. Non-Final Office Action dated Nov. 27, 2017 for U.S. Appl. No. 15/186,576; Response filed Feb. 23, 2018; 7 pages. |
U.S. Final Office Action dated Mar. 1, 2018 for U.S. Appl. No. 15/193,145; 32 pages. |
U.S. Final Office Action dated Mar. 2, 2018 for U.S. Appl. No. 15/193,409; 10 pages. |
U.S. Non-Final Office Action dated Jun. 18, 2018 for U.S. Appl. No. 15/398,819; 8 Pages. |
U.S. Non-Final Office Action dated Oct. 5, 2017 for U.S. Appl. No. 15/193,407; 14 pages. |
U.S. Non-Final Office Action dated Oct. 18, 2017 for U.S. Appl. No. 15/193,145; 21 pages. |
U.S. Non-Final Office Action dated Sep. 15, 2017 for U.S. Appl. No. 15/193,409; 12 pages. |
Notice of Allowance dated Jul. 13, 2018 for U.S. Appl. No. 15/281,172; 13 Pages. |
Response to Non-Final Office Action dated Jun. 18, 2018, for U.S. Appl. No. 15/398,819; Response filed on Sep. 17, 2018; 10 Pages. |
Notice of Allowance dated Oct. 16, 2018 for U.S. Appl. No. 15/398,826; 10 Pages. |
RCE and Response to Final Office Action dated Jun. 19, 2018 for U.S. Appl. No. 15/398,826, filed Aug. 23, 2018; 11 Pages. |
Number | Date | Country | |
---|---|---|---|
20180165034 A1 | Jun 2018 | US |