Distributed storage systems may include a cluster of nodes, each capable of processing I/O requests and/or storing data. A node that receives an I/O request may be different from the node on which the requested data is stored. There is opportunity for data corruption within a storage device (e.g., disk or flash-based storage) and also on communication lines between the nodes.
Many distributed storage systems store data in fixed-size blocks (e.g., 8 KB blocks). Some distributed storage systems support “small” reads and writes, meaning I/O requests to read/write data that is smaller than a full block.
In some embodiments, systems and methods provide end-to-end data protection in a distributed storage system, while reducing the computation and bandwidth required for small reads and writes, as compared to existing systems and methods. In certain embodiments, a system/method does not require a full block of data to be transmitted between nodes if only a small portion of the block is being read. In many embodiments, a system/method does not require re-computing a hash over a full block if only a small portion of the block is updated.
According to an aspect of the disclosure, a method comprises: receiving, at a first node of a distributed storage system, an I/O request to write a block; splitting the block into a plurality of sub blocks; generating a sub block error detection hash for each of the sub blocks; sending the block and the sub block error detection hashes to a second node within the distributed storage system; and storing the block and the sub block error detection hashes to a storage device at the second node.
In some embodiments, generating a sub block error detection hash comprises calculating a hash of a sub block. In another embodiment, a method further comprises calculating a block error detection hash using the sub block error detection hashes, and adding the block error detection hash to metadata within the distributed storage system.
In one embodiment, calculating a block error detection hash using the sub block error detection hashes comprises: concatenating the sub block error detection hashes together, and calculating a hash of the concatenated sub block error detection hashes. In another embodiment, generating a sub block error detection hash for each of the sub blocks comprises generating two or more of the sub block error detection hashes in parallel.
In other embodiments, a method further comprises receiving, at the first node, an I/O request to read one or more sub blocks of the block; reading the requested sub blocks and corresponding sub block error detection hashes from the storage device at the second node; generating, at the second node, an expected error detection hash using the read sub block error detection hashes; sending the read sub blocks and the expected error detection hash from the second node to the first node; generating, at the first node, an actual error detection hash of the read sub blocks; and reporting data corruption of the actual error detection hash and expected error detection hash do not match. In some embodiments, generating an expected error detection hash using the read sub block error detection hashes comprises: concatenating the read sub block error detection hashes together, and calculating a hash of the concatenated read sub block error detection hashes.
In one embodiment, a method further comprises receiving, at the first node, an I/O request comprising one or more updated sub blocks; generating an updated sub block error detection hash for each of the updated sub blocks; sending the updated sub blocks and the updated sub block error detection hashes from the first node to the second node; reading one or more original sub blocks and corresponding sub block error detection hashes from the storage device; generating an updated block using the original sub blocks and the updated sub blocks; and writing the updated block, the original sub block error detection hashes, and the updated sub block error detection hashes to the storage device. In another embodiment, calculating an updated block error detection hash using the original sub block error detection hashes and the updated sub block error detection hashes, and adding the updated block error detection hash to metadata within the distributed storage system.
According to another aspect of the disclosure, a system comprises a processor, a volatile memory, and a non-volatile memory. The non-volatile memory may store computer program code that when executed on the processor causes the processor to execute a process operable to perform one or more embodiments of the method described hereinabove.
According to yet another aspect of the disclosure, a computer program product may be tangibly embodied in a non-transitory computer-readable medium, the computer-readable medium storing program instructions that are executable to perform one or more embodiments of the methods described hereinabove.
The foregoing features may be more fully understood from the following description of the drawings in which:
The drawings are not necessarily to scale, or inclusive of all elements of a system, emphasis instead generally being placed upon illustrating the concepts, structures, and techniques sought to be protected herein.
Before describing embodiments of the structures and techniques sought to be protected herein, some terms are explained. In certain embodiments, as may be used herein, the term “storage system” may be broadly construed so as to encompass, for example, private or public cloud computing systems for storing data as well as systems for storing data comprising virtual infrastructure and those not comprising virtual infrastructure. In some embodiments, as may be used herein, the terms “client,” “customer,” and “user” may refer to any person, system, or other entity that uses a storage system to read/write data.
s In many embodiments, as may be used herein, the term “storage device” may refer to any non-volatile memory (NVM) device, including hard disk drives (HDDs), flash devices (e.g., NAND flash devices), and next generation NVM devices, any of which can be accessed locally and/or remotely (e.g., via a storage attached network (SAN)). In certain embodiments, the term “storage array” may be used herein to refer to any collection of storage devices. In some embodiments herein, for simplicity of explanation, the term “disk” may be used synonymously with “storage device.”
In many embodiments, as may be used herein, the term “random access storage device” may refer to any non-volatile random access memory (i.e., non-volatile memory wherein data can be read or written in generally the same amount of time irrespective of the physical location of data inside the memory). Non-limiting examples of random access storage devices may include NAND-based flash memory, single level cell (SLC) flash, multilevel cell (MLC) flash, and next generation non-volatile memory (NVM). For simplicity of explanation, the term “disk” may be used synonymously with “storage device” herein.
In certain embodiments, the term “I/O request” or simply “I/O” may be used to refer to an input or output request. In some embodiments, an I/O request may refer to a data read or write request.
In some embodiments, vendor-specific terminology may be used herein to facilitate understanding, it is understood that the concepts, techniques, and structures sought to be protected herein are not limited to use with any specific commercial products.
Referring to the embodiments of
The distributed storage system 100 stores data in the storage 106 in fixed-size blocks, for example 8 KB blocks. Each block has a fingerprint that uniquely identifies the data within that block. A block's fingerprint can be generated using a fingerprint function that takes the block's data as input. In various embodiments, the fingerprint function is a hash function, such as Secure Hash Algorithm 1 (SHA-1). In other embodiments, the fingerprint function may be a function that computes an error correcting code.
Referring again to
Each node 102 may be configured to receive I/O requests from clients 108. I/O requests may include “full” reads/writes and “small” reads/writes. A full read/write is a request to read/write a full block of data (e.g., 8 KB of data), whereas a small read/write is a request to read/write less than a full block of data (e.g., less than 8 KB of data). The node 102 that initially receives a client I/O request may be different from the node that is responsible for actually reading/writing the request data. In this situation, the node that initially receives the request (referred to herein as the “router node”) delegates the I/O request to the reading/writing node (referred to herein as the “data node”). Each node 102 may function as either a router node or a data node, depending on the data flow for a particular I/O request. In some embodiments, a node of the distributed storage system may be the same as or similar to the embodiment of
Some embodiments of the present disclosure can provide end-to-end data protection by detecting (and in some cases correcting) data corruption that may occur within a storage device and/or on a communication line between two nodes.
The data node 102b writes the data and error detection hash together to its attached storage device 106b. In some embodiments, the storage system 100 may perform additional processing, such as updating metadata to track the storage location of the unique block. In such embodiments, after the data and error detection hash are successfully written to the storage device 106, acknowledgements (ACK) may be sent from the data node 102b to the router node 102a, and from the router node 102a to the client 108.
It will be appreciated that the embodiments of
In the embodiment of
Referring again to
Referring back to
Referring back to
In some embodiments in addition to storing data, the storage array 206 may store various types of metadata used by the storage system node 200. In such embodiments, metadata may include information within the A2H and H2P tables 212, 214. In particular embodiments, metadata may be used by the various subsystems 202 during the course of processing I/O operations.
Referring to
Hi=F1(bi)
The sub block error detection hashes 304 may be concatenated together and the concatenation may be passed as input to a second code generation function F2 305 to generate a “full” block error detection hash (H) 306:
H=F2(F1(b1)·F1(b2)· . . . ·F1(bN))=F2(H1·H2· . . . ·HN).
In various embodiments, during certain I/O operations, the full block error detection hash 306 may be re-generated using only the sub block error detection hashes 302 (e.g., to reduce processing). In certain embodiments, sub block error detection hashes H1, H2, etc. may be generated in parallel.
Referring again to
Referring to
Referring to
Referring back to
In the embodiment of
Referring again to
In certain embodiments, the second node generates an error detection hash for the full block using the sub block error detection hashes: H=F2(H1·H2· . . . ·HN), where F2 is second a code generation function as described above in conjunction with
Referring again to
Referring again to
Referring again to
Referring back to
Referring to
Referring again to
Referring again to
Referring again to
Referring again to
In various embodiments, the method 440 reduces computation and bandwidth used to process small reads as compared with existing systems, while providing end-to-end data protection.
Referring to
Referring again to
Referring again to
Referring again to
Referring again to
In certain embodiments, the second node may generate a error detection hash of the updated block using the sub block error detection hashes. In one embodiment, assuming the first sub block was updated, the updated block error detection hash may be generated as H′=F2(H′1·H2· . . . ·HN). In some embodiments, the updated block error detection hash may be used as the updated block's unique error detection hash within the distributed storage system. In certain embodiments, the updated block updated block error detection may be stored within metadata (e.g., within the A2H table and/or H2P table).
In various embodiments, the method 460 may reduce computation and bandwidth used to process small writes as compared with existing systems, while providing end-to-end data protection.
In the embodiment of
Processing may be implemented in hardware, software, or a combination of the two. In various embodiments, processing is provided by computer programs executing on programmable computers/machines that each includes a processor, a storage medium or other article of manufacture that is readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and one or more output devices. Program code may be applied to data entered using an input device to perform processing and to generate output information.
The system can perform processing, at least in part, via a computer program product, (e.g., in a machine-readable storage device), for execution by, or to control the operation of, data processing apparatus (e.g., a programmable processor, a computer, or multiple computers). Each such program may be implemented in a high level procedural or object-oriented programming language to communicate with a computer system. However, the programs may be implemented in assembly or machine language. The language may be a compiled or an interpreted language and it may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network. A computer program may be stored on a storage medium or device (e.g., CD-ROM, hard disk, or magnetic diskette) that is readable by a general or special purpose programmable computer for configuring and operating the computer when the storage medium or device is read by the computer. Processing may also be implemented as a machine-readable storage medium, configured with a computer program, where upon execution, instructions in the computer program cause the computer to operate.
Processing may be performed by one or more programmable processors executing one or more computer programs to perform the functions of the system. All or part of the system may be implemented as special purpose logic circuitry (e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit)).
All references cited herein are hereby incorporated herein by reference in their entirety.
Having described certain embodiments, which serve to illustrate various concepts, structures, and techniques sought to be protected herein, it will be apparent to those of ordinary skill in the art that other embodiments incorporating these concepts, structures, and techniques may be used. Elements of different embodiments described hereinabove may be combined to form other embodiments not specifically set forth above and, further, elements described in the context of a single embodiment may be provided separately or in any suitable sub-combination. Accordingly, it is submitted that the scope of protection sought herein should not be limited to the described embodiments but rather should be limited only by the spirit and scope of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
4309569 | Merkle | Jan 1982 | A |
5204958 | Cheng et al. | Apr 1993 | A |
6085198 | Skinner et al. | Jul 2000 | A |
6125399 | Hamilton | Sep 2000 | A |
6671694 | Baskins et al. | Dec 2003 | B2 |
7035971 | Merchant | Apr 2006 | B1 |
7203796 | Muppalaneni et al. | Apr 2007 | B1 |
7472249 | Cholleti et al. | Dec 2008 | B2 |
7908484 | Haukka et al. | Mar 2011 | B2 |
8386425 | Kadayam et al. | Feb 2013 | B1 |
8386433 | Kadayam | Feb 2013 | B1 |
8799705 | Hallak et al. | Aug 2014 | B2 |
9104326 | Frank et al. | Aug 2015 | B2 |
9367398 | Ben-Moshe et al. | Jun 2016 | B1 |
9442941 | Luz et al. | Sep 2016 | B1 |
9703789 | Bowman | Jul 2017 | B2 |
9886314 | Borowiec et al. | Feb 2018 | B2 |
20030061227 | Baskins et al. | Mar 2003 | A1 |
20040267835 | Zwilling et al. | Dec 2004 | A1 |
20060271540 | Williams | Nov 2006 | A1 |
20070240125 | Degenhardt et al. | Oct 2007 | A1 |
20080082969 | Agha et al. | Apr 2008 | A1 |
20080235793 | Schunter et al. | Sep 2008 | A1 |
20090216953 | Rossi | Aug 2009 | A1 |
20100005233 | Hosokawa | Jan 2010 | A1 |
20100250611 | Krishnamurthy | Sep 2010 | A1 |
20110087854 | Rushworth et al. | Apr 2011 | A1 |
20110137916 | Deen et al. | Jun 2011 | A1 |
20110302587 | Nishikawa et al. | Dec 2011 | A1 |
20120023384 | Naradasi | Jan 2012 | A1 |
20120124282 | Frank et al. | May 2012 | A1 |
20120158736 | Milby | Jun 2012 | A1 |
20120204077 | D'Abreu | Aug 2012 | A1 |
20120233432 | Feldman et al. | Sep 2012 | A1 |
20130036289 | Welnicki et al. | Feb 2013 | A1 |
20130212074 | Romanski | Aug 2013 | A1 |
20130290285 | Gopal et al. | Oct 2013 | A1 |
20130318053 | Provenzano et al. | Nov 2013 | A1 |
20130326318 | Haswell | Dec 2013 | A1 |
20130346716 | Resch | Dec 2013 | A1 |
20140019764 | Gopal et al. | Jan 2014 | A1 |
20140032992 | Hara et al. | Jan 2014 | A1 |
20140122823 | Gupta et al. | May 2014 | A1 |
20140244598 | Haustein et al. | Aug 2014 | A1 |
20150019507 | Aronovich | Jan 2015 | A1 |
20150098563 | Gulley et al. | Apr 2015 | A1 |
20150149789 | Seo et al. | May 2015 | A1 |
20150186215 | Das Sharma et al. | Jul 2015 | A1 |
20150199244 | Venkatachalam et al. | Jul 2015 | A1 |
20150205663 | Sundaram et al. | Jul 2015 | A1 |
20150263986 | Park et al. | Sep 2015 | A1 |
20160011941 | He et al. | Jan 2016 | A1 |
20160110252 | Hyun | Apr 2016 | A1 |
20160132270 | Miki | May 2016 | A1 |
20170115878 | Dewaikar | Apr 2017 | A1 |
20170123995 | Freyensee et al. | May 2017 | A1 |
20170255515 | Kim et al. | Sep 2017 | A1 |
Number | Date | Country |
---|---|---|
2014-206884 | Oct 2014 | JP |
WO 2017070420 | Apr 2017 | WO |
Entry |
---|
U.S. Office Action dated Aug. 27, 2015 corresponding to U.S. Appl. No. 14/228,971; 23 Pages. |
Response to U.S. Office Action dated Aug. 27, 2015 corresponding to U.S. Appl. No. 14/228,971; Response filed on Jan. 14, 2016; 10 Pages. |
U.S. Final Office Action dated Feb. 25, 2016 corresponding to U.S. Appl. No. 14/228,971; 27 Pages. |
Request for Continued Examination (RCE) and Response to Final Office Action dated Feb. 25, 2016 corresponding to U.S. Appl. No. 14/228,971; Response filed on May 25, 2016; 12 Pages. |
U.S. Office Action dated Jun. 10, 2016 corresponding to U.S. Appl. No. 14/228,971; 27 Pages. |
Response to U.S. Office Action dated Jun. 10, 2016 corresponding to U.S. Appl. No. 14/228,971; Response filed Aug. 17, 2016; 10 Pages. |
U.S. Final Office Action dated Oct. 4, 2016 corresponding to U.S. Appl. No. 14/228,971; 37 Pages. |
U.S. Office Action dated Sep. 22, 2015 corresponding to U.S. Appl. No. 14/228,982; 17 Pages. |
Response to U.S. Office Action dated Sep. 22, 2015 corresponding to U.S. Appl. No. 14/228,982; Response filed on Feb. 1, 2016; 10 Pages. |
Notice of Allowance dated Apr. 26, 2016 corresponding to U.S. Appl. No. 14/228,982; 9 Pages. |
U.S. Office Action dated Jan. 12, 2016 corresponding to U.S. Appl. No. 14/229,491; 12 Pages. |
Response to Office Action dated Jan. 12, 2016 corresponding to U.S. Appl. No. 14/229,491; Response filed on Jun. 2, 2016; 7 Pages. |
Notice of Allowance dated Jul. 25, 2016 corresponding to U.S. Appl. No. 14/229,491; 10 Pages. |
EMC Corporation, “Introduction to the EMC XtremIO Storage Array;” Version 4.0; White Paper—A Detailed Review; Apr. 2015; 65 Pages. |
Vijay Swami, “XtremIO Hardware/Software Overview & Architecture Deepdive;” EMC On-Line Blog; Nov. 13, 2013; Retrieved from < http://vjswami.com/2013/11/13/xtremio-hardwaresoftware-overview-architecture-deepdive/>; 18 Pages. |
U.S. Non-Final Office Action dated Nov. 28, 2017 corresponding to U.S. Appl. No. 15/079,205; 9 Pages. |
U.S. Non-Final Office Action dated Dec. 29, 2017 corresponding to U.S. Appl. No. 15/079,208; 10 Pages. |
U.S. Non-Final Office Action dated Dec. 22, 2017 corresponding to U.S. Appl. No. 15/282,546; 12 Pages. |
Notice of Allowance dated Sep. 22, 2017 for U.S. Appl. No. 15/079,215; 9 Pages. |
Response (w/RCE) to U.S. Final Office Action dated Jun. 20, 2017 for U.S. Appl. No. 14/228,971; Response filed Sep. 13, 2017; 14 Pages. |
Response to U.S. Non-Final Office Action dated Feb. 9, 2017 for U.S. Appl. No. 14/228,971; Response filed on May 9, 2017; 12 Pages. |
Response to U.S. Non-Final Office Action dated Apr. 21, 2017 for U.S. Appl. No. 15/079,215; Response filed on Jul. 21, 2017; 9 Pages. |
U.S. Non-Final Office Action dated Feb. 9, 2017 for U.S. Appl. No. 14/228,971; 38 Pages. |
Response to U.S. Non-Final Office Action dated Dec. 1, 2017 for U.S. Appl. No. 14/979,890; Response filed on Feb. 28, 2018; 9 Pages. |
Response to U.S. Non-Final Office Action dated Nov. 28, 2017 for U.S. Appl. No. 15/079,205; Response filed on Feb. 28, 2018; 11 Pages. |
Response to U.S. Non-Final Office Action dated Nov. 13, 2017 for U.S. Appl. No. 15/079,213; Response filed on Feb. 13, 2018; 9 Pages. |
Response to Office Action dated Jun. 2, 2017 from U.S. Appl. No. 15/079,208, filed Sep. 5, 2017; 10 Pages. |
Request for Continued Examination (RCE) and Response to Final Office Action dated Oct. 4, 2016 corresponding to U.S. Appl. No. 14/228,971; RCE and Response filed on Jan. 4, 2017; 19 Pages. |
Non-Final Office Action dated Nov. 13, 2017 for U.S. Appl. No. 15/079,213; 9 pages. |
Non-Final Office Action dated Jun. 2, 2017 for U.S. Appl. No. 15/079,208; 19 Pages. |
U.S. Final Office Action dated Jun. 20, 2017 for U.S. Appl. No. 14/228,971; 40 Pages. |
U.S. Non-Final Office Action dated Dec. 1, 2017 for U.S. Appl. No. 14/979,890; 10 Pages. |
Response to U.S. Non-Final Office Action dated Oct. 4, 2017 corresponding to U.S. Appl. No. 14/228,971; Response filed Jan. 26, 2018; 11 Pages. |
U.S. Non-Final Office Action dated Oct. 4, 2017 for U.S. Appl. No. 14/228,971; 37 pages. |
U.S. Non-Final Office Action dated Apr. 21, 2017 for U.S. Appl. No. 15/079,215; 53 Pages. |
U.S. Final Office Action dated May 29, 2018 for U.S. Appl. No. 14/228,971; 35 pages. |
U.S. Appl. No. 14/228,971, filed Mar. 28, 2014, Shoikhet et al. |
U.S. Appl. No. 14/979,890, filed Dec. 28, 2015, Meiri et al. |
U.S. Appl. No. 15/079,205, filed Mar. 24, 2016, Dorfman et al. |
U.S. Appl. No. 15/079,208, filed Mar. 24, 2016, Ben-Moshe et al. |
U.S. Appl. No. 15/079,213, filed Mar. 24, 2016, Ben-Moshe et al. |
U.S. Appl. No. 15/079,215, filed Mar. 25, 2016, Krakov et al. |
U.S. Appl. No. 15/282,546, filed Sep. 30, 2016, Kucherov et al. |
U.S. Appl. No. 15/281,594, filed Sep. 30, 2016, Bigman. |
U.S. Non-Final Office Action dated Jun. 27, 2018 for U.S. Appl. No. 15/281,597; 10 pages. |
Response to U.S. Non-Final Office Action dated Dec. 29, 2017 for U.S. Appl. No. 15/079,208; Response filed on Apr. 30, 2018; 7 Pages, |
Response to U.S. Non-Final Office Action dated Dec. 22, 2017 for U.S. Appl. No. 15/282,546; Response filed May 17, 2018; 8 Page. |