A portion of the disclosure of this patent document contains material which is subject to copyright protection. This patent document may show and/or describe matter which is or may become trade dress of the owner. The copyright and trade dress owner has no objection to the facsimile reproduction by anyone of the patent disclosure as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright and trade dress rights whatsoever.
Field
This disclosure relates to data stored in a data storage system and a method for storing data in a data storage system that allows for replication when a certain node or nodes are offline or unavailable to the core system.
Description of the Related Art
A file system is used to store and organize computer data stored as electronic files. File systems allow files to be found, read, deleted, and otherwise accessed. File systems store files on one or more storage devices. File systems store files on storage media such as hard disk drives, magnetic tape and solid-state storage devices.
Various applications may store large numbers of documents, images, audio, videos and other data as objects using a distributed data storage system in which data is replicated and stored in multiple locations for resiliency.
The systems and methods described herein provide for a replicated data storage system that accommodates nodes that are unavailable or inaccessible for certain periods of time. In practice this system is useful when vessels, vehicles or aircraft are out of range, are not in port or are otherwise unable to be continuously connected to a network for operational, research or military considerations. For example, a ship at sea, a submarine exploring the floor of the ocean, aircraft flying at high altitude, and movable command centers involved with research, surveillance and/or command and control activities may all contain storage zones that are regularly inaccessible to a core network and connect and reconnect to the core network at intervals.
Environment
In the example shown in
The storage clusters 110 and 120 may be separated geographically, may be in separate states, may be in separate countries, may be in separate cities, may be in the same campus or base, may be in different campuses or bases, may be in separate buildings on a shared site, may be on separate floors of the same building, and arranged in other configurations. The stationary zones may be separated in the same location, may be in separate racks, may be in separate buildings on a shared site, may be on separate floors of the same building, and arranged in other configurations. Movable zone 160 may regularly or occasionally be near other storage zones that are part of storage clusters and may regularly or occasionally connect, disconnect and reconnect to the data storage system 100 via one of the storage clusters 110 and 120. The discontinuous nature of the connection of movable zone 160 is shown by the discontinuous lines between the movable zone 160 and stationary zone 112 of cluster 110 and stationary zone 122 of cluster 120. The regular or occasional disconnection and reconnection of a movable zone makes the network of the data storage system a disjointed network such that the data storage system is a disjointed data storage system.
The storage clusters, stationary zones and movable zones communicate with each other and share objects over wide area network 130. The wide area network 130 may be or include the Internet. The wide area network 130 may be wired, wireless, or a combination of these. The wide area network 130 may be public or private, may be a segregated network, and may be a combination of these. The wide area network 130 may include enhanced security features and may not be connected to the Internet. The wide area network 130 includes networking devices such as routers, firewalls, hubs, gateways, switches and the like.
The data storage system 100 may include a server 170 coupled with wide area network 130. The server 170 may augment or enhance the capabilities and functionality of the data storage system by promulgating policies, receiving and distributing search requests, compiling and/or reporting search results, and tuning and maintaining the data storage system. The server 170 may include and maintain an object database on a local storage device included in or coupled with the server 170. The object database may be indexed according to the object identifier or OIDs of the objects stored in the data storage system. In various embodiments, the object database may only store a small amount of information for each object or a larger amount of information. Pertinent to this patent is that the object database store policy information for objects. In one embodiment, the object database is an SQLITE® database. In other embodiments, the object database may be a MONGODB®, Voldemort, or other key-value store. The objects and the object database may be referenced by object identifiers or OIDs like those shown and described below regarding
The term data as used herein includes a bit, byte, word, block, stripe or other unit of information. In one embodiment, data is stored within and by the distributed replicated data storage system as objects. A data item may be store as one object or multiple objects. That is, an object may be a data item or a portion of a data item. As used herein, the term data item is inclusive of entire computer readable files or portions of a computer readable file. The computer readable file may include or represent text, numbers, data, images, photographs, graphics, audio, video, raw data, scientific data, computer programs, computer source code, computer object code, executable computer code, and/or a combination of these and similar information.
Many data intensive applications store a large quantity of data, these applications include scientific applications, newspaper and magazine websites (for example, nytimes.com), scientific lab data capturing and analysis programs, video and film creation software, and consumer web based applications such as social networking websites (for example, FACEBOOK®), photo sharing websites (for example, FLICKR), geo-location based and other information services such as NOW from Google Inc. and SIRI® from Apple Inc., video sharing websites (for example, YOUTUBE®) and music distribution websites (for example, ITUNES®).
The storage zones, namely stationary zones 112, 114, 116, 122, 124 and 126 and movable zone 160, include a computing device and/or a controller on which software may execute. The computing device and/or controller may include one or more of logic arrays, memories, analog circuits, digital circuits, software, firmware, and processors such as microprocessors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), programmable logic device (PLDs) and programmable logic array (PLAs). The hardware and firmware components of the computing device and/or controller may include various specialized units, circuits, software and interfaces for providing the functionality and features described herein. The processes, functionality and features described herein may be embodied in whole or in part in software which operates on a controller and/or one or more computing devices in a storage zone, and may be in the form of one or more of firmware, an application program, object code, machine code, an executable file, an applet, a COM object, a dynamic linked library (DLL), a dynamically loaded library (.so), a script, one or more subroutines, or an operating system component or service, and other forms of software. The hardware and software and their functions may be distributed such that some actions are performed by a controller or computing device, and others by other controllers or computing devices within a storage zone.
A computing device as used herein refers to any device with a processor, memory and a storage device that may execute instructions such as software including, but not limited to, server computers, personal computers, portable computers, laptop computers, smart phones and tablet computers. Server 170 is, depending on the implementation, a specialized or general purpose computing device. The computing devices may run an operating system, including, for example, versions of the Linux, Unix, MICROSOFT® Windows, Solaris, Symbian, Android, Chrome, and APPLE® Mac OS X operating systems. Computing devices may include a network interface in the form of a card, chip or chip set that allows for communication over a wired and/or wireless network. The network interface may allow for communications according to various protocols and standards, including, for example, versions of Ethernet, INFINIBAND® network, Fibre Channel, and others. A computing device with a network interface is considered network capable.
Referring again to
The storage media included in a storage node may be of the same capacity, may have the same physical size, and may conform to the same specification, such as, for example, a hard disk drive specification. Example sizes of storage media include, but are not limited to, 2.5″ and 3.5″. Example hard disk drive capacities include, but are not limited to, 1, 2 3 and 4 terabytes. Example hard disk drive specifications include Serial Attached Small Computer System Interface (SAS), Serial Advanced Technology Attachment (SATA), and others. An example storage node may include 16 three terabyte 3.5″ hard disk drives conforming to the SATA standard. In other configurations, the storage nodes 150 may include more and fewer drives, such as, for example, 10, 12, 24 32, 40, 48, 64, etc. In other configurations, the storage media 155 in a storage node 150 may be hard disk drives, silicon storage devices, magnetic tape devices, other storage media, or a combination of these. In some embodiments, the physical size of the media in a storage node may differ, and/or the hard disk drive or other storage specification of the media in a storage node may not be uniform among all of the storage devices in a storage node 150.
The storage media 155 in a storage node 150 may be included in a single cabinet, rack, shelf or blade. When the storage media in a storage node are included in a single cabinet, rack, shelf or blade, they may be coupled with a backplane. A controller may be included in the cabinet, rack, shelf or blade with the storage devices. The backplane may be coupled with or include the controller. The controller may communicate with and allow for communications with the storage media according to a storage media specification, such as, for example, a hard disk drive specification. The controller may include a processor, volatile memory and non-volatile memory. The controller may be a single computer chip such as an FPGA, ASIC, PLD and PLA. The controller may include or be coupled with a network interface.
In one embodiment, a controller for a node or a designated node, which may be called a primary node, may handle coordination and management of the storage zone. The coordination and management handled by the controller or primary node includes the distribution and promulgation of storage and replication policies. The controller or primary node may implement the replication processes described herein. The controller or primary node may communicate with a server, such as server 170, and maintain and provide local system health information to the requesting server.
In another embodiment, multiple storage nodes 150 are included in a single cabinet or rack such that a storage zone may be included in a single cabinet. When in a single cabinet or rack, storage nodes and/or constituent storage media may be coupled with a backplane. A controller may be included in the cabinet with the storage media and/or storage nodes. The backplane may be coupled with the controller. The controller may communicate with and allow for communications with the storage media. The controller may include a processor, volatile memory and non-volatile memory. The controller may be a single computer chip such as an FPGA, ASIC, PLD and PLA.
A zone may be constructed in one or more racks, shelfs, cabinets and/or other storage units that may be movable or transportable, particularly in the case of movable zones. The movable zone may be included in a single storage unit that may be movable between stationary locations and movable vehicles, watercraft and aircraft. The rack, shelf or cabinet containing a storage zone may include a communications interface that allows for connection to other storage zones, a computing device and/or to a network. The rack, shelf or cabinet containing a storage node 150 may include a communications interface that allows for connection to other storage nodes, a computing device and/or to a network. The communications interface may allow for the transmission of and receipt of information according to one or more of a variety of wired and wireless standards, including, for example, but not limited to, universal serial bus (USB), IEEE 1394 (also known as FIREWIRE® and I.LINK®), Fibre Channel, Ethernet, WiFi (also known as IEEE 802.11). The backplane or controller in a rack or cabinet containing a storage zone may include a network interface chip, chipset, card or device that allows for communication over a wired and/or wireless network, including Ethernet. The backplane or controller in a rack or cabinet containing one or more storage nodes 150 may include a network interface chip, chipset, card or device that allows for communication over a wired and/or wireless network, including Ethernet. In various embodiments, the storage zone, the storage node, the controller and/or the backplane may provide for and support 1, 2, 4, 8, 12, 16, 32, 48, 64, etc. network connections and may have an equal number of network interfaces to achieve this.
The techniques discussed herein are described with regard to storage media and storage devices including, but not limited to, hard disk drives, magnetic tape, optical discs, and solid-state drives. The techniques may be implemented with other readable and writable optical, magnetic and silicon-based storage media as well as other storage media and devices described herein.
In the data storage system 100, files and other data are stored as objects among multiple storage media 155 in a storage node 150. Files and other data are partitioned into smaller portions referred to as objects. The objects are stored among multiple storage nodes 150 in a storage zone. In one embodiment, each object includes a storage policy identifier and a data portion. The object including its constituent data portion may be stored among storage nodes and storage zones according to the storage policy specified by the storage policy identifier included in the object. Various policies may be maintained and distributed or known to the nodes in all zones in the distributed data storage system. The policies may be stored on and distributed from a client 102 to the data storage system 100 and to all zones in the data storage system and to all nodes in the data storage system. The policies may be stored on and distributed from a server 170 to the data storage system 100 and to all zones in the data storage system and to all nodes in the data storage system. The policies may be stored on and distributed from a primary node or controller in each storage zone in the data storage system.
As used herein, policies specify replication and placement for the object among the storage nodes and storage zones of the data storage system. In other versions of the system, the policies may specify additional features and components. The replication and placement policy defines the replication, encoding and placement of data objects in the data storage system. Example replication and placement policies include, full distribution, single copy, single copy to a specific zone, copy to all zones except a specified zone, copy to half of the zones, copy to zones in certain geographic area, copy to all zones except for zones in certain geographic areas, and others. In addition, the policy may specify that the objects are to be erasure encoded in which the data is encoded and stored across multiple storage devices, storage nodes and/or storage zones in the data storage system. A character (e.g., A, B, C, etc.) or number (0, 1, 2, etc.) or combination of one or more characters and numbers (A1, AAA, A2, BC3, etc.) or other scheme may be associated with and used to identify each of the replication, encoding and placement policies. The policy may be stored as a byte or word, where a byte is 8 bits and where a word may be 16, 24, 32, 48, 64, 128, or other number of bits. The policy is included as a policy identifier in an object identifier shown in
Referring again to
The data storage systems described herein may provide for one or multiple kinds of storage replication and data resiliency. The data storage systems described herein may operate as a fully replicated distributed data storage system in which all data is replicated among all storage zones such that all copies of stored data are available from and accessible from all storage zones. This is referred to herein as a fully replicated storage system.
Another configuration of a data storage system provides for partial replication such that data may be replicated in one or more storage zones in addition to an initial storage zone to provide a limited amount of redundancy such that access to data is possible when a zone goes down or is impaired or unreachable, without the need for full replication. The partial replication configuration does not require that each zone have a full copy of all data objects.
Replication may be performed synchronously, that is, completed before the write operation is acknowledged; asynchronously, that is, the replicas may be written before, after or during the write of the first copy; or a combination of each. During data ingest, synchronous replication provides for a high level of data resiliency while asynchronous replication provides for resiliency at a lower level. As described herein, replication may be synchronous and/or asynchronous while all zones are connected to the data storage system. When a movable zone is disconnected from the system, the remaining stationary and connected movable zones may operate in a synchronous manner, but the overall system operates in an asynchronous manner as the movable disconnected zone is not connected to the data storage system.
To facilitate the management and replication of objects in the data storage system, an object database on the server 170 may store information about each object. The object database may be indexed according to the object identifier or OIDs of the objects. The object database may be an SQLITE® database. In other embodiments the database may be a MONGODB®, Voldemort, or other key-value store.
The objects and the object database may be referenced by object identifier or OIDs like those shown and described regarding
In one version of the system, the location identifier 302 is 30 bits, but may be other sizes in other implementations, such as, for example, 24 bits, 32 bits, 48 bits, 64 bits, 128 bits, 256 bits, 512 bits, etc. In one version of the system, the location identifier 302 includes both a group identifier (“group ID”) and an index. The group ID may represent a collection of objects stored under the same policy, and having the same searchable metadata fields. The group ID of the object becomes a reference for the embedded database of the object group. The group ID may be used to map the object to a particular storage node or storage device, such as a hard disk drive. The mapping may be stored in a mapping table maintained by the object storage system. The mapping information is distributed and is hierarchical. More specifically, the system stores a portion of mapping information in memory, and the storage nodes hold a portion of the mapping information in their memory. Master copies of the mapping information are kept on disk or other nonvolatile storage medium on the storage nodes. The master copies of the mapping information are dynamically updated to be consistent with any changes made while the system is active. The index may be the specific location of the object within the group. The index may refer to a specific location on disk or other storage device.
The unique identifier 304 is a unique number or alphanumeric sequence that is used to identify the object in the storage system. The unique identifier 304 may be randomly generated, may be the result of a hash function of the object itself (that is, the data or data portion), may be the result of a hash function on the metadata of the object, or may be created using another technique. In one embodiment, the unique identifier is assigned by the controller in such a manner that the storage device is used efficiently. The unique identifier 304 may be stored as 24 bits, 32 bits, 64 bits, 128 bits, 256 bits, 512 bits, 1 kilobyte, etc.
The object identifier 300 may optionally include flags 306. Flags 306 may be used to distinguish between different object types by providing additional characteristics or features of the object. The flags may be used by the data storage system to evaluate whether to retrieve or delete objects. In one embodiment, the flags associated with the object indicate if the object is to be preserved for specific periods of time, or to authenticate the client to ensure that there is sufficient permission to access the object. In one version of the system, the flags 306 portion of the OID 300 is 8 bits, but may be other sizes in other implementations, such as, for example, 16 bits, 32 bits, 48 bits, 64 bits, 128 bits, 256 bits, 512 bits, etc.
The policy identifier 308 is described above in para. [0032].
The total size of the object identifier may be, for example, 128 bits, 256 bits, 512 bits, 1 kilobyte, 4 kilobytes, etc. In one embodiment, the total size of the object identifier includes the sum of the sizes of the location identifier, unique identifier, flags, policy identifier, and version identifier. In other embodiments, the object identifier includes additional data that is used to obfuscate the true contents of the object identifier. In other embodiments, other kinds and formats of OIDs may be used.
In some embodiments, when the data objects are large, the data object may be partitioned into sub-objects. The flags 308 may be useful in the handling of large data objects and their constituent sub-objects. Similarly, the group ID may be included as part of the location ID 304, and may be used in mapping and reassembling the constituent parts of large data objects.
Processes
The methods described herein accommodate movable zones that are disconnected from the network that connects the stationary zones. In this way, the methods describe how a disjoint storage systems manages movable zones. In practice, reconnaissance aircraft (for example airplanes, blimps, and unmanned aerial vehicles), ocean exploratory vessels (for example, ships and submarines), spacecraft (for example, satellites, space ships), mobile command centers, and the like may be disconnected from a primary network and the data storage system but reconnect regularly or occasionally. When the movable zones reconnect, the data captured and stored on the nodes in the movable zone are stored on and distributed among the stationary zones according to the particular policies for the objects stored on the movable zone. In one configuration, the objects originating from movable zones may all be members of the same object group. In other configurations the objects stored on a movable zone may be members of one or multiple object groups, and it is the groups that specify the storage and distribution requirements of the objects. The distribution of the objects from a movable zone may be determined by the object group and/or policy identifier for the particular objects.
Referring now to
Referring now to
Next, depending on the policy and/or group specified in the OID of objects copied from the movable zone to the stationary zone, objects are replicated through the storage system, as shown in block 550. This includes copying the object to other zones in the cluster to which the movable zone is currently connected as well as copying the object to other zones in other clusters in the data storage system based on the policy and/or group specified in the OID of the objects originating from the movable zone. This allows for replication of objects in the data storage system according to the policies and group information for objects stored on the movable zone.
Further, the stationary zone evaluates objects stored on the stationary zone and in the cluster in view of policies and group information and copies or transfers objects from the stationary zone to the movable zone based on the policies and group information of objects stored on the stationary zone, as shown in block 560. In this situation, in practice, objects that may have been created and stored in the stationary zone/cluster when the movable zone was disconnected are identified and copied or transferred to the movable zone. This allows for replication of objects in the data storage system according to the policies and group information for objects stored on the stationary zone throughout the data storage system.
In various configurations, the actions in blocks 530, 540, 550 and 560 may be performed concurrently, sequentially, overlapping, and/or or in any order.
The movable zone while connected with the stationary zone/cluster functions as a stationary zone until it loses connectivity with the cluster, as shown in block 570. When the movable zone loses connectivity with the cluster, it functions as a stand-alone zone, as shown in block 580. When functioning as a stand-alone zone, the movable zone cannot fully achieve the distribution requirements of the groups or policies for the objects it stores. The movable zone delays action on fulfilling the zone and/or group requirements until the movable zone regains connectivity with other zones or clusters in the data storage system. The flow of actions continues with the movable zone connecting to a stationary zone or cluster, as shown in block 510.
The methods described regarding
Closing Comments
Throughout this description, the embodiments and examples shown should be considered as exemplars, rather than limitations on the apparatus and procedures disclosed or claimed. Although many of the examples presented herein involve specific combinations of method acts or system elements, it should be understood that those acts and those elements may be combined in other ways to accomplish the same objectives. With regard to flowcharts, additional and fewer steps may be taken, and the steps as shown may be combined or further refined to achieve the methods described herein. Acts, elements and features discussed only in connection with one embodiment are not intended to be excluded from a similar role in other embodiments.
As used herein, “plurality” means two or more.
As used herein, a “set” of items may include one or more of such items.
As used herein, whether in the written description or the claims, the terms “comprising”, “including”, “carrying”, “having”, “containing”, “involving”, and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of” and “consisting essentially of”, respectively, are closed or semi-closed transitional phrases with respect to claims.
Use of ordinal terms such as “first”, “second”, “third”, etc., “primary”, “secondary”, “tertiary”, etc. in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
As used herein, “and/or” means that the listed items are alternatives, but the alternatives also include any combination of the listed items.