Various embodiments of the present disclosure are generally directed to an apparatus and method for updating map structures in an object storage system, such as a cloud computing environment.
In accordance with some embodiments, a proxy server communicates with users of the object storage system over a computer network. A plurality of data storage devices have associated memory to store and retrieve data objects of the users. The data storage devices are arranged into a plurality of locations within the object storage system. A storage controller is associated with each location to direct data transfers between the data storage devices of the associated location and the proxy server using an existing map structure that describes the data objects in each location. A management module is adapted to generate a new map structure, migrate at least one data object from a first location described by the existing map structure to a second location described by the new map structure, and to distribute the new map structure to each of the storage controllers after the migration of the at least one data object.
The present disclosure generally relates to the migration of data in an object storage system, such as in a cloud computing environment.
Cloud computing generally refers to a network-based distributed data processing environment. Network services such as computational resources, software and/or data are made available to remote users via a wide area network, such as but not limited to the Internet. A cloud computing network can be a public “available-by-subscription” service accessible by substantially any user for a fee, or a private “in-house” service operated by or for the use of one or more dedicated users.
A cloud computing network is generally arranged as an object storage system whereby data objects (e.g., files) from users (“account holders” or simply “accounts”) are replicated and stored in storage locations within the system. Depending on the system configuration, the locations may be geographically distributed. The network may be accessed through web-based tools such as web browsers, and provides services to a user as if such services were installed locally on the user's local computer. Other tools can be used including command line tools, etc.
Object storage systems are often configured to be massively scalable so that new storage nodes, servers, software modules, etc. can be added to the system to expand overall capabilities in a manner transparent to the user. An object storage system can continuously carry out significant amounts of background overhead processing to store, replicate, migrate and rebalance the data objects stored within the system in an effort to ensure the data objects are available to the users at all times.
Various embodiments of the present disclosure are generally directed to advancements in the manner in which an object storage system deploys updated mapping within the system. As explained below, in some embodiments a server is adapted to communicate with users of the object storage system over a network. A plurality of data storage devices store and retrieve data objects from the users. The data storage devices are arranged into a plurality of locations (e.g., zones) each corresponding to a different physical location within the distributed object storage system and having an associated storage controller. Map structures are used to associate storage entities such as the data objects with physical locations within the data storage devices.
A map management module is adapted to generate a new map structure, migrate at least one data object from a first storage location described by the existing map structure to a second storage location described by the new map structure, and then deploy (distribute) the new map structure to each of the storage controllers. Thereafter, the storage controllers use the new map structure to direct data object transfer operations between the respective locations and the server.
In some cases, the map structures are referred to as rings, and the rings are arranged as an account ring, a container ring and an object ring. The account ring provides lists of containers, or groups of data objects owned by a particular user (“account”). The container ring provides lists of data objects in each container, and the object ring provides lists of data objects mapped to their particular storage locations. Other forms of map structures can be used.
By migrating the data prior to deployment of the new map structures, the system will be substantially up to date at the time of deployment and the data objects in the system will nominally match the new map structures. This can reduce system disturbances by substantially eliminating the need for system services to react and perform large scale data migrations to conform the system to the newly deployed maps. In this way, the data objects can be quickly and efficiently migrated to the new mapping in the background without substantively affecting user data access operations or overhead processing within the system.
These and various other features of various embodiments disclosed herein can be understood beginning with a review of
The system 100 is accessed by one or more user devices 102, which may take the form of a network accessible device such as a desktop computer, a terminal, a laptop, a tablet, a smartphone, a game console or other device with network connectivity capabilities. In some cases, each user device 102 accesses the system 100 via a web-based application on the user device that communicates with the system 100 over a network 104. The network 104 may take the form of the Internet or some other computer-based network.
The system 100 includes various elements that may be geographically distributed over a large area. These elements include one or more management servers 106 which process communications with the user devices 102 and perform other system functions. A plurality of storage controllers 108 control local groups of storage devices 110 used to store data objects from the user devices 102 as requested, and to return the data objects as requested. Each grouping of storage devices 110 and associated controller 108 is characterized as a storage node 112.
While only three storage nodes 112 are illustrated in
Generally, data presented to the system 100 by the users of the system are organized as data objects, each constituting a cohesive associated data set (e.g., a file) having an object identifier (e.g., a “name”). Examples include databases, word processing and other application files, graphics, A/V works, web pages, games, executable programs, etc. Substantially any type of data object can be stored depending on the parametric configuration of the system.
Each data object presented to the system 100 will be subjected to a system replication policy so that multiple copies of the data object are stored in different zones. It is contemplated albeit not required that the system nominally generates and stores three (3) replicas of each data object. This enhances data reliability, but generally increases background overhead processing to maintain the system in an updated state.
An example hardware architecture for portions of the system 100 is represented in
The storage rack 118 is a 42 U server cabinet with 42 units (U) of storage, with each unit extending about 1.75 inches (in) of height. The width and length dimensions of the cabinet can vary but common values may be on the order of about 24 in.×36 in. Each storage enclosure 120 can have a height that is a multiple of the storage units, such as 2 U (3.5 in.), 3 U (5.25 in.), etc.
In some cases, the functionality of the storage controller 108 can be carried out using the local computer 116. In other cases, the storage controller functionality carried out by processing capabilities of one or more of the storage enclosures 120, and the computer 116 can be eliminated or used for other purposes such as local administrative personnel access. In one embodiment, each storage node 112 from
An example configuration for a selected storage enclosure 120 is shown in
In the context of an HDD, the storage media may take the form of one or more axially aligned magnetic recording discs which are rotated at high speed by a spindle motor. Data transducers can be arranged to be controllably moved and hydrodynamically supported adjacent recording surfaces of the storage disc(s). While not limiting, in some embodiments the storage devices 122 are 3½ inch form factor HDDs with nominal dimensions of 5.75 in×4 in×1 in.
In the context of an SSD, the storage media may take the form of one or more flash memory arrays made up of non-volatile flash memory cells. Read/write/erase circuitry can be incorporated into the storage media module to effect data recording, read back and erasure operations. Other forms of solid state memory can be used in the storage media including magnetic random access memory (MRAM), resistive random access memory (RRAM), spin torque transfer random access memory (STRAM), phase change memory (PCM), in-place field programmable gate arrays (FPGAs), electrically erasable electrically programmable read only memories (EEPROMs), etc.
In the context of a hybrid (SDHD) device, the storage media may take multiple forms such as one or more rotatable recording discs and one or more modules of solid state non-volatile memory (e.g., flash memory, etc.). Other configurations for the storage devices 122 are readily contemplated, including other forms of processing devices besides devices primarily characterized as data storage devices, such as computational devices, circuit cards, etc. that at least include computer memory to accept data objects or other system data.
The storage enclosures 120 include various additional components such as power supplies 124, a control board 126 with programmable controller (CPU) 128, fans 130, etc. to enable the data storage devices 122 to store and retrieve user data objects.
An example software architecture of the system 100 is represented by
The proxy server 136 accesses a plurality of map structures, or rings, to control data flow to the respective data storage devices 112 (
The account ring 140 provides lists of containers, or groups of data objects owned by a particular user (“account”). The container ring 142 provides lists of data objects in each container, and the object ring 144 provides lists of data objects mapped to their particular storage locations.
Each ring 140, 142, 144 has an associated set of services 150, 152, 154 and storage 160, 162, 164. The storage may or may not be on the same devices. The services and storage enable the respective rings to maintain mapping using zones, devices, partitions and replicas. The services may be realized by software, hardware and/or firmware. In some cases, the services are software modules representing programming executed by an associated processor of the system.
As discussed previously, a zone is a physical set of storage isolated to some degree from other zones with regard to disruptive events. A given pair of zones can be physically proximate one another, provided that the zones are configured to have different power circuit inputs, uninterruptable power supplies, or other isolation mechanisms to enhance survivability of one zone if a disruptive event affects the other zone. Contrawise, a given pair of zones can be geographically separated so as to be located in different facilities, different cities, different states and/or different countries.
Devices refer to the physical devices in each zone. Partitions represent a complete set of data (e.g., data objects, account databases and/or container databases) and serve as an intermediate “bucket” that facilitates management locations of the data objects within the cluster. Data may be replicated at the partition level so that each partition is stored three times, one in each zone. The rings further determine which storage devices are used to service a particular data access operation and which devices should be used in failure handoff scenarios.
In at least some cases, the object services block 154 can include an object server arranged as a relatively straightforward blob server configured to store, retrieve and delete objects stored on local storage devices. The objects are stored as binary files on an associated file system. Metadata may be stored as file extended attributes (xattrs). Each object is stored using a path derived from a hash of the object name and an operational timestamp. Last written data always “wins” in a conflict and helps to ensure that the latest object version is returned responsive to a user or system request. Deleted objects are treated as a 0 byte file ending with the extension “.ts” for “tombstone.” This helps to ensure that deleted files are replicated correctly and older versions do not inadvertently reappear in a failure scenario.
The container services block 152 can include a container server which processes listings of objects in respective containers without regard to the physical locations of such objects. The listings may be as SQLite database files or some other form, and are replicated across a cluster similar to the manner in which objects are replicated. The container server may also track statistics with regard to the total number of objects and total storage usage for each container.
The account services block 150 may incorporate an account server that functions in a manner similar to the container server, except that the account server maintains listings of containers rather than objects. To access a particular data object, the account ring 140 may be consulted to identify the associated container(s) for the account, the container ring 142 may be consulted to identify the associated data object(s), and the object ring 144 may be consulted to locate the various copies in physical storage. Alternatively, as discussed above the account and container identifications may be supplied as arguments and the object is identified directly. Regardless, the user input specifies one or more data objects, and commands are issued to the appropriate storage node 112 (
Additional services 172 incorporated by or used in conjunction with the account, container and ring services 150, 152, 154 are represented in
The system services 172 can include include replicators 174, updaters 176, auditors 178 and a ring management module 180. Generally, the replicators 170 attempt to maintain the system in a consistent state by comparing local data with each remote copy to ensure all are at the latest version. Object replication can use a hash list to quickly compare subsections of each partition, and container and account replication can use a combination of hashes and other data as desired.
The updaters 176 attempt to correct out of sync issues due to failure conditions or periods of high loading when updates cannot be timely serviced. The auditors 178 crawl the local system checking the integrity of objects, containers and accounts. If an error is detected with a particular entity, the entity is quarantined and other services are called to rectify the situation.
The ring management module 180 operates to process updates to the map (ring) structures.
The map data structure 182 is shown to include three primary elements: a list of devices 184, a partition assignment list 186 and a partition shift hash 188. The list of devices (devs) 184 lists all data storage devices 122 that are associated with, or that are otherwise accessible by, the associated ring, as shown in Table 1.
Generally, ID provides an index of the devices list by device identification (ID) value. ZONE indicates the zone in which the data storage device is located. WEIGHT indicates a relative weight factor of the storage capacity of the device relative to other storage devices in the system. For example, a 2 TB (terabyte, 1012 bytes) drive may be given a weight factor of 2.0, a 4 TB drive may be given a weight factor of 4.0, and so on.
IP ADDRESS is the IP address of the storage controller associated with the device. TCP PORT identifies the TCP port the storage controller uses to serve requests for the device. DEVICE is the name of the device within the host system, and is used to identify the disk mount point. METADATA is a general use field that can be used to store various types of arbitrary information as needed.
The partition assignment list 186 generally maps partitions to the individual devices. This data structure is a nested list: N lists of M+2 elements, where N is the number of replicas for each of M partitions. In some cases, the list 186 may be arranged to list the device ID for the first replica of each M partitions in the first list, the device ID for the second replica of each M partitions in the second list, and so on. The number of replicas N is established by the system administrators and may be set to three (e.g., N=3) or some other value. The number of partitions M is also established by the system administrators and may be a selected power of two (e.g., M=220, etc.).
The partition shift value 188 is a number of bits taken from a selected hash of the “account/container/object” path to provide a partition index for the path. The partition index may be calculated by translating a binary portion of the hash value into an integer number.
The access command is forwarded to the associated storage node, and the local storage controller 108 schedules a read operation upon the storage memory (mem) 122A of the associated storage device(s) 122. In some cases, system services may determine which replicated set of the data should be accessed to return the data. The data objects (retrieved data) are returned from the associated device and forwarded to the proxy server which in turn forwards the requested data to the user device 138.
It will be noted that the data are migrated prior to the deployment (dissemination or promulgation) of the new map structures so that when the new map structures are promulgated to the various servers, the data objects will already be stored in locations that substantially conform to the newly promulgated maps. In this way, other resources of the system such as set forth in
As shown in
The map builder module 190 proceeds to generate a new map structure (“new mapping”), and supplies such to the data migration module 192. The map builder module 190 may further supply the existing mapping to the data migration module 192. As further depicted in
A migration sequencing block 196 schedules and directs the migration of data objects within the storage nodes 112 to conform the data storage state to the new mapping. This may include the issuance of various data migration commands to the respective storage nodes 112 in the system, as represented in
In response to the data migration commands, various data objects may be read, temporarily stored, and rewritten to different ones of the various storage devices 112 in the storage nodes. It is contemplated that at least some of the migrated data objects will be migrated from one zone to the next. The migration sequencing block 196 may receive command complete status indications from the nodes signifying the status of the ongoing data migration effort.
It will be noted that while the data are being migrated, the data state will be intentionally placed in a condition where it deviates from the existing map structures of the respective nodes. In some cases, a transition management block 198 may communicate with other services of the system (e.g., the replicators 174, updaters 176, auditors 178, etc. of
Once the data migration is complete, a migration complete status may be generated by the data migration module 192 and forwarded to the map builder module 190 as indicated in
Another issue that may arise from this processing is the handling of data access commands as in
In such case, the migration sequencing block 196 may be configured to intelligently select the order in which data objects are migrated and/or tombstoned during the migration process. As the system maintains multiple replicas of every set of data objects (e.g., N=3), in some cases at least one set of data objects are maintained in an existing mapping structure so that data access commands can be issued to those replica sets of the data objects not affected by the migration. In some cases, these pristine replicas can be denoted as “source” replicas so that any access commands received during the data migration process are serviced from these replicas.
Additionally or alternatively, a temporary translation table can be generated so that, should the objects not be found using the existing mapping, the translation table can be consulted and a copy of the desired data objects can be returned from a cached copy and/or from the new location indicated by the new mapping.
At step 202, various data objects supplied by users 138 of the system 100 are replicated in storage devices 122 housed in different locations (e.g., zones) in accordance with an existing mapping structure. The existing mapping structure may include the account, container and object rings 140, 142, 144 discussed above having a format such as set forth in
At some point during the operation of the system 100, a new map structure is generated as indicated at step 204. This will be carried out by the map builder module 190 of
The data objects that require migration to conform the system to the new map structure are identified at step 206. This may be carried out by the data migration module 192 of
Although not shown in
Once the data migration is confirmed as being completed, step 210, the new map structures are deployed to the various storage nodes and other locations throughout the system at step 212, after which the process ends at step 214. This can be carried out automatically or manually by system operators. For example, the system can notify the system operators that the data migration (including a substantial portion thereof) has been completed, and the system operators can thereafter direct the deployment of the data to the various locations in the system.
The systems embodied herein are suitable for use in cloud computing environments as well as a variety of other environments. Data storage devices in the form of HDDs, SSDs and SDHDs have been illustrated but are not limiting, as any number of different types of media and operational environments can be adapted to utilize the embodiments disclosed herein
It is to be understood that even though numerous characteristics and advantages of various embodiments of the present disclosure have been set forth in the foregoing description, together with details of the structure and function of various embodiments thereof, this detailed description is illustrative only, and changes may be made in detail, especially in matters of structure and arrangements of parts within the principles of the present disclosure to the full extent indicated by the broad general meaning of the terms in which the appended claims are expressed.