Various embodiments of the present disclosure are generally directed to an apparatus and method for migrating data within an object storage system using available storage system bandwidth.
In accordance with some embodiments, a server communicates with users of the object storage system over a network. A plurality of data storage devices are grouped into zones, with each zone corresponding to a different physical location within the object storage system. A controller direct transfers of data objects between the server and the data storage devices of a selected zone. A rebalancing module directs migration of sets of data objects between zones in relation to an available bandwidth of the network.
In accordance with other embodiments, an object storage system has a plurality of storage nodes each with a storage controller and an associated group of data storage devices each having associated memory. A server is connected to the storage nodes and configured to direct a transfer of data objects between the storage nodes and at least one user device connected to the distributed object storage system. A rebalancing module is configured to identify an existing system utilization level associated with the transfer of data objects from the server, to determine an overall additional data transfer capability of the distributed object storage system above the existing system utilization level, and to direct a migration of data between the storage nodes during the sample period at a rate nominally equal to the additional data transfer capability.
In accordance with other embodiments, a computer-implemented method includes steps of arranging a plurality of data storage devices into a plurality of zones of an object storage system, each zone corresponding to a different physical location and having an associated controller; using a server to store data objects from users of the object storage system in the respective zones; detecting an available bandwidth of the server; and directing migration of data objects between the zones in relation to the detected available bandwidth.
The present disclosure generally relates to the migration of data in an object storage system, such as in a cloud computing environment.
Cloud computing generally refers to a network-based distributed data processing environment. Network services such as computational resources, software and/or data are made available to remote users via a wide area network, such as but not limited to the Internet. A cloud computing network can be a public “available-by-subscription” service accessible by substantially any user for a fee, or a private “in-house” service operated by or for the use of one or more dedicated users.
A cloud computing network is generally arranged as a distributed object storage system whereby data objects (e.g., files) from users (“account holders” or simply “accounts”) are replicated and stored in geographically distributed storage locations within the system. The network is often accessed through web-based tools such as web browsers, and provides services to a user as if such services were installed locally on the user's local computer.
Object storage systems (sometimes referred to as “distributed object storage systems”) are often configured to be massively scalable so that new storage nodes, servers, software modules, etc. can be added to the system to expand overall capabilities in a manner transparent to the user. A distributed object storage system can continuously carry out significant amounts of background overhead processing to store, replicate, migrate and rebalance the data objects stored within the system in an effort to ensure the data objects are available to the users at all times.
Various embodiments of the present disclosure are generally directed to advancements in the manner in which an object storage system migrates data objects within the system. As explained below, in some disclosed embodiments a server is adapted to communicate with users of the distributed object storage system over a computer network. A plurality of data storage devices are arranged to provide memory used to store and retrieve data objects of the users of the system. The data storage devices are grouped into a plurality of zones, with each zone corresponding to a different physical location within the distributed object storage system.
A storage controller is associated with each zone of data storage devices. Each storage controller is adapted to direct data transfers between the data storage devices of the associated zone and the proxy server.
During a data migration operation in which data objects are migrated to a new location, a rebalancing module detects the then-existing available bandwidth of the system. The available bandwidth generally represents that portion of the overall capacity of the system that is not currently being used to handle user traffic. The rebalancing module directs the migration of a set of data objects within the system in relation to the detected available bandwidth. In this way, the data objects can be quickly and efficiently migrated without substantively affecting user data access operations with the system.
The available bandwidth can be measured or otherwise determined in a variety of ways. In some cases, traffic levels are measured at the proxy server level. In other cases, an aggregation switch is monitored to determine the available bandwidth. Software routines can be implemented to detect, estimate or otherwise report the respective traffic levels.
These and various other features of various embodiments disclosed herein can be understood beginning with a review of
The system 100 is accessed by one or more user devices 102, which may take the form of a network accessible device such as a desktop computer, a terminal, a laptop, a tablet, a smartphone, a game console or other device with network connectivity capabilities. In some cases, each user device 102 accesses the system 100 via a web-based application on the user device that communicates with the system 100 over a network 104. The network 104 may take the form of the Internet or some other computer-based network.
The system 100 includes various elements that are geographically distributed over a large area. These elements include one or more management servers 106 which process communications with the user devices 102 and perform other system functions. A plurality of storage controllers 108 control local groups of storage devices 110 used to store data objects from the user devices 102, and to return the data objects as requested. Each grouping of storage devices 110 and associated controller 108 is characterized as a storage node 112.
While only three storage nodes 112 are illustrated in
Generally, data presented to the system 100 by the users of the system are organized as data objects, each constituting a cohesive associated data set (e.g., a file) having an object identifier (e.g., a “name”). Examples include databases, word processing and other application files, graphics, A/V works, web pages, games, executable programs, etc. Substantially any type of data object can be stored depending on the parametric configuration of the system.
Each data object presented to the system 100 will be subjected to a system replication policy so that multiple copies of the data object are stored in different zones. It is contemplated albeit not required that the system nominally generates and stores three (3) replicas of each data object. This enhances data reliability, but generally increases background overhead processing to maintain the system in an updated state.
An example hardware architecture for portions of the system 100 is represented in
The storage rack 118 is a 42 U server cabinet with 42 units (U) of storage, with each unit extending about 1.75 inches (in) of height. The width and length dimensions of the cabinet can vary but common values may be on the order of about 24 in.×36 in. Each storage enclosure 120 can have a height that is a multiple of the storage units, such as 2 U (3.5 in.), 3 U (5.25 in.), etc.
In some cases, the functionality of the storage controller 108 can be carried out using the local computer 116. In other cases, the storage controller functionality carried out by processing capabilities of one or more of the storage enclosures 120, and the computer 116 can be eliminated or used for other purposes such as local administrative personnel access. In one embodiment, each storage node 112 from
An example configuration for a selected storage enclosure 120 is shown in
In the context of an HDD, the storage media may take the form of one or more axially aligned magnetic recording discs which are rotated at high speed by a spindle motor. Data transducers can be arranged to be controllably moved and hydrodynamically supported adjacent recording surfaces of the storage disc(s). While not limiting, in some embodiments the storage devices 122 are 3½ inch form factor HDDs with nominal dimensions of 5.75 in×4 in×1 in.
In the context of an SSD, the storage media may take the form of one or more flash memory arrays made up of non-volatile flash memory cells. Read/write/erase circuitry can be incorporated into the storage media module to effect data recording, read back and erasure operations. Other forms of solid state memory can be used in the storage media including magnetic random access memory (MRAM), resistive random access memory (RRAM), spin torque transfer random access memory (STRAM), phase change memory (PCM), in-place field programmable gate arrays (FPGAs), electrically erasable electrically programmable read only memories (EEPROMs), etc.
In the context of a hybrid (SDHD) device, the storage media may take multiple forms such as one or more rotatable recording discs and one or more modules of solid state non-volatile memory (e.g., flash memory, etc.). Other configurations for the storage devices 122 are readily contemplated, including other forms of processing devices besides devices primarily characterized as data storage devices, such as computational devices, circuit cards, etc. that at least include computer memory to accept data objects or other system data.
The storage enclosures 120 include various additional components such as power supplies 124, a control board 126 with programmable controller (CPU) 128, fans 130, etc. to enable the data storage devices 122 to store and retrieve user data objects.
An example software architecture of the system 100 is represented by
The proxy server 136 is connected to a plurality of rings including an account ring 140, a container ring 142 and an object ring 144. Other forms of rings can be incorporated into the system as desired. Generally, each ring is a data structure that maps different types of entities to locations of physical storage. The account ring 140 provides lists of containers, or groups of data objects owned by a particular user (“account”). The container ring 142 provides lists of data objects in each container, and the object ring 144 provides lists of data objects mapped to their particular storage locations.
Each ring 140, 142, 144 has an associated set of services 150, 152, 154 and storage 160, 162, 164. The services and storage enable the respective rings to maintain mapping using zones, devices, partitions and replicas. As mentioned above, a zone is a physical set of storage isolated to some degree from other zones with regard to disruptive events. A given pair of zones can be physically proximate one another, provided that the zones are configured to have different power circuit inputs, uninterruptable power supplies, or other isolation mechanisms to enhance survivability of one zone if a disruptive event affects the other zone. Contrawise, a given pair of zones can be geographically separated so as to be located in different facilities, different cities, different states and/or different countries.
Devices refer to the physical devices in each zone. Partitions represent a complete set of data (e.g., data objects, account databases and container databases) and serve as an intermediate “bucket” that facilitates management locations of the data objects within the cluster. Data may be replicated at the partition level so that each partition is stored three times, one in each zone. The rings further determine which devices are used to service a particular data access operation and which devices should be used in failure handoff scenarios.
In at least some cases, the object services block 154 can include an object server arranged as a relatively straightforward blob server configured to store, retrieve and delete objects stored on local storage devices. The objects are stored as binary files on an associated file system. Metadata may be stored as file extended attributes (xattrs). Each object is stored using a path derived from a hash of the object name and an operational timestamp Last written data always “wins” in a conflict and helps to ensure that the latest object version is returned responsive to a user or system request. Deleted objects are treated as a 0 byte file ending with the extension “.ts” for “tombstone.” This helps to ensure that deleted files are replicated correctly and older versions do not inadvertently reappear in a failure scenario.
The container services block 152 can include a container server which processes listings of objects in respective containers without regard to the physical locations of such objects. The listings may be as SQLite database files or some other form, and are replicated across a cluster similar to the manner in which objects are replicated. The container server may also track statistics with regard t other total number of objects and total storage usage for each container.
The account services block 150 may incorporate an account server that functions in a manner similar to the container server, except that the account server maintains listings of containers rather than objects. To access a particular data object, the account ring 140 is consulted to identify the associated container(s) for the account, the container ring 142 is consulted to identify the associated data object(s), and the object ring 144 is consulted to locate the various copies in physical storage. Commands are thereafter issued to the appropriate storage node 112 (
Additional services incorporated by or used in conjunction with the rings 140, 142, 144 can include replication services, updating services, ring building services, auditing services and rebalancing services. The replication services attempt to maintain the system in a consistent state by comparing local data with each remote copy to ensure all are at the latest version. Object replication can use a hash list to quickly compare subsections of each partition, and container and account replication can use a combination of hashes and shared high water marks.
The updating services attempt to correct out of sync issues due to failure conditions or periods of high loading when updates cannot be timely serviced. The ring building services build new rings when appropriate, such as when new data and/or new storage capacity are provided to the system. Auditors crawl the local system checking the integrity of objects, containers and accounts. If an error is detected with a particular entity, the entity is quarantined and other services are called to rectify the situation.
In accordance with various embodiments, rebalancing services are provided by a rebalancing module 170 of the system 100 as represented in
The rebalancing module 170 includes a monitor module 172 and a data migration module 174. The monitor module 172 is operationally responsive to a variety of inputs, including system utilization indications, the deployment of new mapping, the addition of new storage, etc. These and other inputs can signal the monitor module 172 a need to migrate data from one location to another.
Rebalancing may be required, for example, in a storage node 112 to which a new server cabinet 114 (see
Accordingly, at such time that the monitor module 172 determines that a data migration operation is required, the monitor module 172 identifies an available bandwidth of the system 100. The available bandwidth represents the data transfer capacity of the system that is not currently being utilized to service data transfer operations with the users of the system. In some cases, the available bandwidth, BAVAIL, can be determined as follows:
B
AVAIL=(CTOTAL−CUSED)*(1−K) (1)
Where CTOTAL is the total I/O data transfer capacity of the system, CUSED is that portion of the total I/O data transfer capacity of the system that is currently being used, and K is a derating (margin) factor. The capacity can be measured in terms of bytes/second transferred between the proxy server 136 and each of the users 138 (see
The CUSED value can be obtained by the monitor module 172 directly or indirectly measuring, or estimating, the instantaneous or average traffic volume per unit time at the proxy server 136. Other locations within the system can be measured in lieu of, or in addition to, the proxy server. Generally, however, it is contemplated that the loading at the proxy server 136 will be indicative of overall system loading in a reasonably balanced system.
The derating factor K can be used to provide margin for both changes in peak loading as well as errors in the determined measurements. A suitable value for K may be on the order of 0.02 to 0.05, although other values can be used as desired. It will be appreciated that other formulations and detection methodologies can be used to assess the available bandwidth in the system.
The available bandwidth BAVAIL may be selected for a particular sample time period TN. The sample time period can have any suitable resolution, such as ranging from a few seconds to a few minutes or more depending on system performance. Sample durations can be adaptively adjusted responsive to changes (or lack thereof) in system utilization levels.
The available bandwidth BAVAIL is provided to the data migration module 174, which selects an appropriate volume of data objects to be migrated during the associated sample time period TN. The volume of data migrated is selected to fit within the available bandwidth for the time period. In this way, the migration of the data will generally not interfere with ongoing data access operations with the users of the system. The process is repeated for each successive sample time period TN+1, TN+2, etc. until all of the pending data have been successfully migrated.
In sum, the proxy server 136 has a total data transfer capacity in terms of a total possible number of units of data transferrable per unit of time. The rebalancing module 170 determines the available bandwidth in relation to a difference between the total data transfer capacity and an existing system utilization level of the proxy server, which comprises an actual number of units of user data transferred per unit of time. It will be appreciated that where and how the available bandwidth is measured or otherwise determined will depend in part upon the particular architecture of the system.
From a comparison of the relative heights of the respective cross-sectional areas 187, 189 in
At step 202, data objects supplied by users 138 are replicated in storage devices 122 housed in different zones. Various map structures including account, container and object rings are generated to track the locations of these replicated sets.
New storage mapping is deployed at step 204, such as due to a failure condition, the addition of new memory, or some other event that results in a perceived need to perform a rebalancing operation to migrate data from one zone to another.
The monitor module 172 of
At step 210, the data migration module 174 of
The data sets are migrated at step 212, which involves other system services of the architecture to arrange, configure and transfer the data to the new storage location(s). Various other steps such as updated ring structures, tombstoning, etc. may be carried out as well.
Decision step 214 determines whether additional data objects should be migrated, and if so, the routine returns to step 206 for a new measurement of the then-existing system utilization level. In some cases, the migration module 174 may request a command complete status from the invoked resources and compare the actual transfer time to the estimated time to determine whether the data migrations in fact took place in the expected time frame over the last time period. Faster than expected transfers may result in more data object volume being migrated during a subsequent time period, and slower than expected transfers may result in smaller data object volume being migrated during a subsequent time period.
The foregoing processing continues until all data migrations have been completed, at which point any remaining system parameters are updated, step 216, and the process ends at step 218.
In further embodiments, the monitor module 172 of
The volume detector 220 generally operates to detect the volume of data being processed by the proxy server 136 (
The operation of these various features can be observed from graphical representations of adaptive data migration operations as set forth in
Data migration curve segments 234, 236 are located on opposing sides of the peak utilization point 232, and the cross-hatched areas under these respective segments and above line 230 correspond to first and second data migration intervals. A threshold T1 is denoted by broken line 238. This threshold is established and monitored by the threshold circuit 226 of
From
In this way, the rebalancing module 170 (
In
A second threshold T2 is represented by broken line 248, and the data migration operation is resumed (under curve 246) once the system utilization curve falls below this second threshold 248. In some cases, both threshold detection and slope detection mechanisms can be employed to initiate and suspend data migration operations. For example, a relatively low slope may allow data migrations to continue at a relatively higher overall system utilization level, whereas relatively high slopes may signify greater volatility in system utilization and cause the discontinuation (or reduction) of data migrations to account for greater variations. Large volatility in the system utilization rates can cause other adaptive adjustments as well; for example, increases in slope of a system utilization curve (e.g., S1) can cause an increase in the derating factor K (equation (1)) to provide more margin while still allowing data migrations to continue.
Other factors such as historical data (e.g., history log 228), time of day/week/month, previous access (e.g., read/write) patterns, etc. can be included in the adaptive data migration scheme. In this way, data migrations can be adaptively scheduled to maximize data transfers without significantly impacting existing user access to the system.
The controller rack 302 includes an aggregation switch 306 and one or more proxy servers 308. Each storage rack 304 includes a so-called top of the rack (TOTR) switch 310, one or more storage servers 312, and one or more groups of storage devices 314. Other elements can be incorporated into the respective racks, and the configuration can be expanded as required. In one embodiment, each controller rack 302 is associated with three (3) adjacent storage racks.
As depicted in
Individual connections are further provided between the aggregation switch 306 and the TOTR switches 310. The TOTR switches provide an access path for the elements in the associated storage rack 304. The storage servers 312 are connected to the TOTR switches 310 in each storage rack 304, and the storage devices 314 (not depicted in
Different types of data transfers involve different elements within the architecture 300. For example, user access requests are received by the aggregation switch 306 and processed by a selected proxy server 308. The proxy server 308 in turn services the request by passing appropriate access commands through the aggregation switch 306 to the appropriate TOTR switch 310, and from there to the appropriate storage server 312 and storage device 314 (
Internal data migration, balancing and other operations may or may not involve the aggregation switch 306. For example, movement of data from one storage server to another within the same storage rack 304 may be routed through the associated TOTR switch 310. On the other hand, movement of data from one storage rack 304 to another requires passage through the aggregation switch 306.
The available bandwidth can be determined as discussed above by monitoring the system at one or more locations. In some cases, monitoring the movement of user data in service of user communications at the aggregation switch 306 can be used to measure or estimate the available bandwidth. In other cases, each of the proxy servers 308 can be monitored to determine the available bandwidth. Software routines can be executed on the local server(s) and/or switches to measure then-existing levels of user traffic.
Referring again to
With reference again to
The systems embodied herein are suitable for use in cloud computing environments as well as a variety of other environments. Data storage devices in the form of HDDs, SSDs and SDHDs have been illustrated but are not limiting, as any number of different types of media and operational environments can be adapted to utilize the embodiments disclosed herein
As used herein, the term “available bandwidth” and the like will be understood consistent with the foregoing discussion to describe a data transfer capability/capacity of the system (e.g., network) as the difference between an overall data transfer capacity/capability of the system and that portion of the overall data transfer capacity/capability that is currently utilized to transfer data with users/user devices of the system (e.g., the existing system utilization level). The available bandwidth may or may not be reduced by a small derating margin (e.g., the factor K in equation (1)).
It is to be understood that even though numerous characteristics and advantages of various embodiments of the present disclosure have been set forth in the foregoing description, together with details of the structure and function of various embodiments thereof, this detailed description is illustrative only, and changes may be made in detail, especially in matters of structure and arrangements of parts within the principles of the present disclosure to the full extent indicated by the broad general meaning of the terms in which the appended claims are expressed.