The field of the invention relates generally to computer storage systems. In particular, the present method and system is directed to a multipurpose storage system based upon a distributed hashing mechanism with transactional support and failover capacity.
As storage needs increases, solutions have to be found to drive the cost of storage down and maintain ease of management. The use of a Chord based network (a peer-to-peer technology) partially solves certain problems. The use of self organizing finger tables solves the problem of scaling by avoiding the need of centralized information. The use of intelligent routing limits the number of request to reach a node. The use of consistent hashing also limits the impacts of modifying the network topology (when adding or removing nodes, when nodes fail).
The use of a Chord network ensures overall consistency of routing (except some limitations), self-organizing stabilization but does not provide a real way to replicate information, nor to rebalance content in case of topology change.
Best practice solutions move the complexity of managing storage into dedicated storage systems to save application servers from embedding storage disks directly, avoiding many inconveniences such as disk failure management, data loss, data reconstruction, and enabling economics of scale by better managing a shared pool of storage resources. Typical technologies include:
Object stores that do not follow the centralized architecture design can be deployed on a large cluster of generic servers, pushing fault tolerance on the software and the network stack rather than dedicated storage hardware.
Because SAN technology is block based as opposed to file based and slices storage capacity into monolithic volumes, solutions derived from this technology cannot perform storage optimization based on files or objects and have to manipulate small anonymous binary blobs called blocks with very little metadata attached to them. Recent improvements such as thin provisioning, i.e. over allocation of storage space for each volume to minimize the need for growing existing volumes are natural evolutions.
Object stores are re-emerging and put more emphasis on metadata and file awareness to push more intelligence into the storage solution including file access patterns and domain specific metadata that can be utilized to implement per file classes of storage. For example, an email platform using an object store instead of a volume based approach could add metadata declaring a message as legitimate, undesired or high priority. The object store could use the metadata to change classes of storage appropriately. For example, the system may maintain only one copy of illegitimate messages or keep high priority messages in cache for faster access.
A multipurpose storage system based upon a distributed hashing mechanism with transactional support and failover capability is disclosed. According to one embodiment, a system comprises a client system in communication with a network, a secondary storage system in communication with the network, and a supervisor system in communication with the network. The supervisor system assigns a unique identifier to a first node system and places the first node system in communication with the network in a location computed by using hashing. The client system stores a data object on the first node system.
The above and other preferred features, including various novel details of implementation and combination of elements, will now be more particularly described with reference to the accompanying drawings and pointed out in the claims. It will be understood that the particular methods and circuits described herein are shown by way of illustration only and not as limitations. As will be understood by those skilled in the art, the principles and features described herein may be employed in various and numerous embodiments without departing from the scope of the invention.
The accompanying drawings, which are included as part of the present specification, illustrate the presently preferred embodiment and together with the general description given above and the detailed description of the preferred embodiment given below serve to explain and teach the principles of the present invention.
a and 7b illustrate exemplary transaction validation tables within a multipurpose storage system, according to one embodiment.
a illustrates an exemplary put operation within a multipurpose storage system, according to one embodiment.
b illustrates an exemplary put operation within a multipurpose storage system, according to one embodiment.
a is an exemplary list of commands within a multipurpose storage system, according to one embodiment.
b is an exemplary list of transaction types within a multipurpose storage system, according to one embodiment.
c is an exemplary list of return codes within a multipurpose storage system, according to one embodiment.
a and 20b illustrate exemplary aging and packing mechanisms within a multipurpose storage system, according to one embodiment.
It should be noted that the figures are not necessarily drawn to scale and that elements of similar structures or functions are generally represented by like reference numerals for illustrative purposes throughout the figures. It also should be noted that the figures are only intended to facilitate the description of the various embodiments described herein. The figures do not describe every aspect of the teachings described herein and do not limit the scope of the claims.
A multipurpose storage system based upon a distributed hashing mechanism with transactional support and failover capability is disclosed. According to one embodiment, a system comprises a client system in communication with a network, a secondary storage system in communication with the network, and a supervisor system in communication with the network. The supervisor system assigns a unique identifier to a first node system and places the first node system in communication with the network in a location computed by using hashing. The client system stores a data object on the first node system.
Chord based technology does not provide a way to manage fault tolerance and availability of content in cases of node joins, leaves or failures. This limitation is overcome by using a clear assignation system for node ID's, chunk ID's and replica ID's and by using a transactional system that: (1) guarantees to store a chunk and all of its replicas, (2) guarantees to retrieve a chunk or one of its replicas in case of a node failure and (3) guarantees to delete a chunk and all of its replicas.
Chord based technology does not provide a way to automatically circumvent or repair a ring that is temporarily missing nodes (and content), that has misplaced information (in case of node reappearance) or definitive content failure. This limitation is overcome by using supervising computers that trigger automatic rebalancing (MOVE) and also detect chunk range overlaps. The supervising computers send information to node computers that improve availability of content.
The present embodiments provide a scalable storage system with no central point. The present system has the advantages of using low expenditure devices (e.g. cheap micro computers using cheap SATA disks) to build low cost, robust and scalable storage systems.
The present system and method uses a Chord network as a key/value store. Included in the present system are a replication system, and transactional support. Ensured are automatic redundancy, persistence and availability of content, and aging and packing of content before sending it to an archival system. The present embodiments further concern a corresponding computer software product, a key/value storage device, a message store device and a dynamic content caching device.
In the following description, for purposes of explanation, specific nomenclature is set forth to provide a thorough understanding of the various inventive concepts disclosed herein. However, it will be apparent to one skilled in the art that these specific details are not required in order to practice the various inventive concepts disclosed herein.
Some portions of the detailed descriptions that follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. A method is here, and generally, conceived to be a self-consistent process leading to a desired result. The process involves physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The present method and system also relates to apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (“ROMs”), random access memories (“RAMs”), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the method and system as described herein.
A data storage device 127 such as a magnetic disk or optical disc and its corresponding drive may also be coupled to computer system 100 for storing information and instructions. Architecture 100 can also be coupled to a second I/O bus 150 via an I/O interface 130. A plurality of I/O devices may be coupled to I/O bus 150, including a display device 143, an input device (e.g., an alphanumeric input device 142 and/or a cursor control device 141).
The communication device 140 allows for access to other computers (servers or clients) via a network. The communication device 140 may comprise one or more modems, network interface cards, wireless network interfaces or other well known interface devices, such as those used for coupling to Ethernet, token ring, or other types of networks.
Due to the particularity of the Chord algorithm, communication occurs point to point from potentially many servers to many different servers (many to many) with no central communication point, the global performance of the present system does not depend on the number of these components.
The present system does not require the use of supervisor computers 204 during normal operation. Normal operation is characterized by storing chunks, retrieving chunks, deleting chunks, handling a given number of failures, among other operations. Thus the node computers 202 do not view supervisor computers 204 until the latter have connected to them. Supervisor computers 204 are used for inserting node computers 202, deleting node computers 202, improving overall synchronization of the different components, and offering a system administration view of the present system.
Initiator computers 201 are the interface to the outside world and they are clients to the present system. They support access to the system with the following protocols: key/value store interface by using a custom transactional “Chord client API”, or FUSE.
Secondary storage systems 203 can be other Chord rings (similar to 205), SAN's, or dispersed storage installations.
According to one embodiment, the present system is deployed as a standalone storage solution (without the usage of a secondary storage), or as a storage cache when secondary storage is present. In the latter case an aging and packing mechanism is used as described below in conjunction with
Node 800 responds with its ID, state, and range 304. Node 900 responds with its ID, state, and range 305. Node 1000 responds with its ID, state, and range 306. The range is [IDlow,IDhigh], or the lowest chunk ID to the highest chunk ID hosted by a node computer. Based upon the received responses the supervisor computer can detect overlaps, and in this example one is detected between Node 900 and Node 1000307. An overlap is when a node hosts some chunk that it should not host according to its ID information. The supposed range is the range between its predecessor and its own ID (Chord). However, after an operation (for example a join described in
ID0=0,IDn+1=(IDx+((IDy==0)?[2m−IDx]:[IDy−IDx]))×Φ)≡2m
where the key space range IDx and IDy determines the widest range between two existing nodes on the ring, IDy>IDx on the directed ring, φ being a real number, and m the number of bits in the Chord key space 2m.
For a given distributed chunk ID's (see
Entropy information 601 can be a random number (assigned by an entropy device or by a hash function) or a given number. The number should be equiprobable or approaching equiprobability on the range 0 to 2m-p, where m is the number of bits of the key space. P=8 is the number of bits for coding class and replica information.
Class information 602 is coded on 4 bits in the chunk ID, according to this example, and defines the number of replicas or a custom replica policy. The main replica is not counted in the number of replicas. The custom replica policies are defined outside the chunk ID and are fetched before being applied.
Replica information 603 is coded on 4 bits. Replica number 0 is the main replica. Other replica ID's (606, 607) are computed using an “angle” 604 that is equivalent to 2π divided by the total number of replicas. The formula does not depend upon π but on modulo arithmetic:
The jth replica given any replica of the IDn, given the number of replica R and m the number of bits in the Chord key space 2m, p the number of bits coding class and replica information.
As a property of the formula, when a replica ID is known, the chunk ID of any of other replicas may be calculated.
When there is no particular overlap, chunk retrieval is possible because initiator computers find the successor of a node with a classical Chord find_successor( ) operation and ask for the chunk by using GET or GET_LOCAL commands (see command list in
A specific API called “Chord client API” offers a robust key/value store API to initiator computers that stores, retrieves and deletes data chunks into and from the ring safely by using a transaction concept. All of the described actions have a reservation operation that maintains the integrity of the data: 1) chunk ID preparation on the initiator computer, 2) chunk ID checking and update of transaction table on the node computer, 3) transaction management, 4) command acceptance or rejection, and 5) result/return management.
Chunk ID preparation is done by the caller of the API on the initiator computer. A unique chunk ID is provided that is reserved along with all its replicas. A random chunk ID may be generated or a unique chunk ID may be picked elsewhere. The chunk IDs of all the replicas are also computed by using the formula described in
Chunk ID checking and update of transaction table is executed on the node computers that receive the RESERVE command. It represents in two actions: checking and updating a transaction table, and verifying whether the requested chunk ID (main chunk ID or replica) already exists or is not on the node.
Transaction management includes sending all the RESERVE requests at once to one or many replicas. This mechanism guarantees that the chunk ID and all its replicas are available for the following GET, PUT or DELETE command. By determining in advance how many replicas are available, a number is determined below which the action is not executed. For example, if the availability of four replicas is checked and only three are found, storing three out of four may be found to be acceptable, and continue processing. On the contrary, if only two replicas out of four are found, the put command may not execute, and retry with different chunk IDs or cancel the operation.
Common acceptance or rejection is at the level of the node computer to guarantee data integrity. Situations exist where a current transaction will forbid some commands that may have been emitted in the meantime (when sending a RESERVE, the transaction should be kept (e.g. an ‘X’ in
Result and return management monitors the status sent to a caller after an attempt to retrieve, store or delete a chunk.
a and 7b illustrate exemplary transaction validation tables within a multipurpose storage system, according to one embodiment. The table in
The table in
Since there may be many overlaps, there may be many proxies set for a node (client and server proxies). When operating on a chunk a node computer tries all the matching proxy ranges.
If initiator computer 1120 had the wrong successor information then the RESERVE command targeted node 1000. In such a case, the behavior is the same as just described with regard to
Note this RESERVE operation may be done before any of the GET, PUT and DELETE commands as described below.
According to one embodiment, 840 is already hosted by node 900. In such a case there would have been no need to search node 1000.
According to one embodiment, initiator computers still believe that successor of 840 is node 1000. In such a case the chunk might be still on 1000, so it would be fetched, or already be on 900. Node 1000 checks its proxy information 1201 to a GET_LOCAL on 900.
a illustrates an exemplary PUT operation on a proxy client within a multipurpose storage system, according to one embodiment. Proxy information is set on both nodes 1301, 1302. An initiator computer 1320 requests that a chunk be written 1303. Note the RESERVE command has previously checked and reserved the chunks for writing. The node 900 targets the right successor for the chunk ID and no special action is required for the remote node, and the chunk is locally stored 1304. A success indicator is returned 1305 to the initiator 1320.
b illustrates an exemplary PUT operation on a proxy server within a multipurpose storage system, according to one embodiment. Proxy information is set on both nodes 1301, 1302. An initiator computer 1320 requests that a chunk be written 1306 but has the wrong successor information. The chunk is stored on the disk of the wrong successor 1307, but will be accessible through the GET proxy command as in
The purge operation is launched from time-to-time (controlled by supervisor computer) to erase physically the chunks marked as deleted older than a given time. The command is sent to all the nodes.
a illustrates an exemplary list of commands 1801 and their descriptions 1802 used for failover and transactional support extensions within a multipurpose storage system, according to one embodiment. Commands 1801 include, yet are not limited to, the following.
b illustrates an exemplary list of transaction types 1803 and their descriptions 1804 within a multipurpose storage system, according to one embodiment. Transaction types 1803 include, yet are not limited to, the following.
c illustrates an exemplary list of return codes 1805 and their descriptions 1806 within a multipurpose storage system, according to one embodiment. Return codes 1805 include, yet are not limited to the following.
a illustrates an exemplary aging and packing mechanism within a multipurpose storage system, according to one embodiment. If activated, the packing process is automatically started on node computers 2021. Chunks are found with a modification time older than a specified timeout 2001. Chunks are marked as archived in metadata and contents are deleted 2002. A collection of chunks is packed into a bigger block 2003, and sent 2004 to a secondary storage system 2022.
b illustrates an exemplary aging and packing mechanism within a multipurpose storage system, according to one embodiment. An initiator computer 2020 fetches a chunk 2005 that is archived 2006 (e.g. content is no longer present on the disk but the metadata is still present). Metadata contains information to fetch the block storing the chunk 2007 on secondary storage 2022 and 2008. A block is unpacked 2009 then sent back 2010 to the initiator 2020.
A multipurpose storage system based upon a distributed hashing mechanism with transactional support and failover capability have been disclosed. It is understood that the embodiments described herein are for the purpose of elucidation and should not be considered limiting the subject matter of the disclosure. Various modifications, uses, substitutions, combinations, improvements, methods of productions without departing from the scope or spirit of the present invention would be evident to a person skilled in the art.
The present application claims the benefit of and priority to U.S. Provisional Patent Application No. 61/138,759 entitled “MULTIPURPOSE STORAGE SYSTEM BASED UPON A DISTRIBUTED HASHING MECHANISM WITH TRANSACTIONAL SUPPORT AND FAILOVER CAPABILITY” filed on Dec. 18, 2008, and is hereby, incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
6094713 | Khadder et al. | Jul 2000 | A |
6983322 | Tripp et al. | Jan 2006 | B1 |
7251663 | Smith | Jul 2007 | B1 |
7389305 | Kindig et al. | Jun 2008 | B1 |
7433928 | Ranade et al. | Oct 2008 | B1 |
7788225 | Fish et al. | Aug 2010 | B2 |
8139508 | Roskind | Mar 2012 | B1 |
20060239275 | Zlateff et al. | Oct 2006 | A1 |
20070133554 | Ederer et al. | Jun 2007 | A1 |
20070156842 | Vermeulen et al. | Jul 2007 | A1 |
20070233832 | Narayanan et al. | Oct 2007 | A1 |
20080005203 | Bots et al. | Jan 2008 | A1 |
20080123664 | Schwan et al. | May 2008 | A1 |
20080172563 | Stokes | Jul 2008 | A1 |
20080215663 | Ushiyama | Sep 2008 | A1 |
20080222154 | Harrington et al. | Sep 2008 | A1 |
20090041017 | Luk | Feb 2009 | A1 |
20090049523 | LiVecchi et al. | Feb 2009 | A1 |
20090094380 | Qiu et al. | Apr 2009 | A1 |
20090187757 | Kerschbaum | Jul 2009 | A1 |
20100110935 | Tamassia et al. | May 2010 | A1 |
20110022883 | Hansen | Jan 2011 | A1 |
Entry |
---|
Kristian Waagan: “Building a Replicated Data Store Using Berkeley DB and the Chord DHT” Internet Citation; Aug. 22, 2005, pp. 1-91, XP002504247—Retrieved from the Internet: URL:htt:p//www.diva-portal.org/ntnu/abstract.xsql?dbid=624 [retrieved on Dec. 5, 2008]. |
Ion Stoica et al: “Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications” IEEE / ACM Transactions on Networking, IEEE / ACM, New York, NY, US, vol. 11, No. 1, Feb. 1, 2003, XP011077215 ISSN: 1063-6692, p. 1-7. |
WO, PCT/US2010/068565—Search Report, Jul. 10, 2010. |
Number | Date | Country | |
---|---|---|---|
20100162035 A1 | Jun 2010 | US |
Number | Date | Country | |
---|---|---|---|
61138759 | Dec 2008 | US |