Distributed object storage

Information

  • Patent Grant
  • 10545914
  • Patent Number
    10,545,914
  • Date Filed
    Tuesday, January 17, 2017
    7 years ago
  • Date Issued
    Tuesday, January 28, 2020
    4 years ago
Abstract
The disclosure provides a system, method and computer-readable storage device embodiments. Some embodiments can include an IPv6-centric distributed storage system. An example method includes receiving, at a computing device, a request to create metadata associated with an object from a client, creating the metadata based on the request and transmitting the metadata and an acknowledgment to the client, wherein the metadata contains an address in a storage system for each replica of the object and wherein the metadata can be used to write data to the storage system and read the data from the storage system. There is no file system layer between an application layer and a storage system layer.
Description
TECHNICAL FIELD

The present disclosure relates storage of data and more particularly to a distributed storage system that utilizes a pool of metadata servers and a pool of storage nodes which utilizes unique addresses for content, such as IPv6 (or similar) addresses.


BACKGROUND

A lot of different distributed storage systems exist, such as the Google file system, Ceph, Hadoop, Amazon EC2 are a few of the most common storage systems. Ceph is an object storage system that optionally provides a traditional file system interface with POSIX semantics. Object storage systems complement but do not replace traditional file systems. One can run one storage cluster for object, block and file-based data storage. Ceph's file system runs on top of the same object storage system that provides object storage and block device interfaces. The Ceph metadata server cluster provides a service that maps the directories and file names of the file system to objects stored within RADOS (Reliable Autonomic Distributed Object Store) clusters. The metadata server cluster can expand or contract, and it can rebalance the data dynamically to distribute data evenly among cluster hosts. This ensures high performance and prevents heavy loads on specific hosts within the cluster.


Storage systems with typical architectures have a number of issues that reduce their efficiency. These issues include many layers of software through which communication must pass to write and read data. The heavy layering increases the complexity of the system which can require detailed configuration and optimization efforts. The current architectures also are difficult to scale given the layering and complexity issues. Furthermore, all these architectures are constructed on the fundamental assumption the disks are the performance bottleneck. Much software engineering has been spent to find solutions (e.g. File System caches) to mask poor disk performances. New solid-state device (SSD) technologies are likely to make deciduous this foundational assumption. As a consequence, a whole industry could literally fall apart and be replaced by new approaches in which the storage devices are not any more considered as the performance bottleneck. These and other issues suggest a need in the art for improved processes for managing data storage.





BRIEF DESCRIPTION OF THE DRAWINGS

In order to describe the manner in which the above-recited and other advantages and features of the disclosure can be obtained, a more particular description of the principles briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only exemplary embodiments of the disclosure and are not therefore to be considered to be limiting of its scope, the principles herein are described and explained with additional specificity and detail through the use of the accompanying drawings in which:



FIG. 1 illustrates an example system configuration;



FIG. 2 illustrates a layering of software for managing object storage;



FIG. 3 illustrates an approach of the communication of metadata servers and storage nodes;



FIG. 4 illustrates a layered structure and raid storage devices and a server;



FIG. 5 illustrates signal processing for a create request;



FIG. 6 illustrates further the process of placing replicas in connection with the create request;



FIG. 7 illustrates the signal processing and a write request;



FIG. 8 illustrates further processing in connection with the write request;



FIG. 9 illustrates processing in a read request;



FIG. 10 illustrates further processing in connection with a read request;



FIG. 11 illustrates the logical view of the response ability of every actor in the storage system;



FIG. 12 illustrates an object storage architecture;



FIG. 13 an aspect of the storage architecture according to this disclosure; and



FIG. 14 illustrates a method example.





DESCRIPTION OF EXAMPLE EMBODIMENTS

Various embodiments of the disclosure are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the disclosure.


Overview

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be obvious from the description, or can be learned by practice of the herein disclosed principles. The features and advantages of the disclosure can be realized and obtained by means of the instruments and combinations particularly pointed out in the appended claims. These and other features of the disclosure will become more fully apparent from the following description and appended claims, or can be learned by the practice of the principles set forth herein.


The present disclosure addresses the issues raised above with respect to storing and managing large flow tables on switches. The disclosure aims at solving the problems by collapsing layers and averaging new (IPv6, for example) functionalities such as segment routing. The proposed solution also addresses the CPU/Network potential bottleneck consequence of forthcoming super high performance storage devices.


The disclosure provides a system, method and computer-readable storage device embodiments. An example method includes receiving, at a computing device, a request to create metadata associated with an object from a client, creating the metadata based on the request and transmitting the metadata and an acknowledgment to the client, wherein the metadata contains an address in a storage system for each replica of the object and wherein the metadata can be used to write data to the storage system and read the data from the storage system.


An aspect of this disclosure is that there is no filesystem layer between the application layer and the storage system. In another aspect, the file system can become the application as described herein. The storage system contains the pool of metadata servers and the pool of stored servers. Writing and reading the data from the storage system can be accomplished via an IPv6 address stored in or associated with the metadata. The IPv6 address can identify and/or locate the data. In one example, the IPv6 prefix can be used to represent a group of addresses and/or subnets. Moreover, the IPv6 prefix can represent specific nodes and/or classes of data, objects, storage, etc. Classes can be based on one or more factors such as quality of service (QoS) requirements, priorities, policies, size, partitioning, a similarity, a state, a property, usage characteristics, a preference, a parameter, a data type, storage characteristics, etc. For example, an IPv6 prefix can represent, without limitation, a specific node, a specific type of storage, or a specific type of data.


A metadata prefix can represent a metadata server, a storage node, metadata classes, etc. Metadata replicas have distinct IPV6 addresses and in one aspect would not be identified by a prefix. In some cases, a metadata prefix is assigned to each tenant in a multi-tenant environment. This can enable isolation, improve security, facilitate management, prevent collisions, etc. The client can compute a family of pseudorandom seeded hashes based on at least one of an object name and consecutive integers as seeds. The client can compute a family of pseudorandom seeded X-bit hashes based on an object name. X can be less than or equal to 128. The value can depend on the length of the metadata IPv6 prefix assigned to a storage domain. The organization of metadata servers is unknown to the client and can be dynamic if metadata servers are added or removed. Only the global metadata IPv6 prefix is static. The metadata can include, without limitation, the address for metadata replicas, the address of object replicas, state information associated with the object replicas, object name, object characteristics (e.g., size, properties, etc.), storage node or system information (e.g., access control lists, policies, configuration data, etc.), and so forth. Thus, for example, the metadata, when used to write the data to the storage system, can be utilized to write replica data to the storage system.


The method can further include, by the computing device, determining where to store the data on the storage system based on one or more of a placement policy, system-wide metrics, client recommendations, and quality of service requirements. As previously noted, an IPv6 address can be used to identify the data and/or location of the data, and the prefix associated with the IPv6 address can identify or represent the storage node, storage segment, data class, etc.


Description

The present disclosure addresses the issues raised above. For example, the architecture disclosed herein is flexible, scalable, and not heavily layered as prior approaches. Accordingly, the imposition of the amount of complexity with so many layers can be reduced into a more simplified system. This can reduce the number of disk I/O's, bottlenecks, use of mass storage, and multiple layers. The approach disclosed herein also improves the ability to expand the scaling of the storage system. The present disclosure addresses these problems and other problems by collapsing layers and leveraging IPv6 functionality such as, but not limited to, segment routing.


The disclosure first discusses in more detail some of the issues with standard storage system. Storage systems generally fall under a particular type of architecture 200 shown in FIG. 2. The application layer 202 runs on top of the storage system. The file system layer 204, while often part of the storage system, is not mandatory if the applications are designed to work directly with an object storage 206. There also can be distributed file systems that run directly on a block storage 208 such as the Google File System (GFS). The object storage and/or block storage 210 are considered the heart of the overall storage system. The structures vary a lot between the different distribution storage systems. FIG. 3 illustrates the general architecture 300 for these various layers.


The system 300 stores metadata about the stored objects, files, and/or the whole system on metadata servers 302. Depending on which system, there can be multiple metadata servers (As in the Hadoop Distributed File System or HDFS) or just one (GFS). In these systems, a protocol 304 is designed for the client applications to communicate with the metadata servers 302. The data is ultimately stored in storage nodes 306. The protocol 302 is often based on HTTP. A pool of storage nodes contains the actual data. These nodes 306 are often organized in a structure: a ring, a tree, or any other structure. The protocol 302 is then used for the clients and the metadata nodes 302 to interact with the storage nodes to write or retrieve data, replicate contents at the file system level, load balance or any other features of the system.



FIG. 4 illustrates a storage node structure 400. The storage nodes 402 includes an application layer 404 that is in charge of receiving requests and handling them based on the protocol mentioned above. The application layer 404 usually sits on top of the local file system 406 having partitions that contain stored contents. For local replication, a redundant array of independent disks (RAID) controller 408 can be used to ensure the data is not lost on the storage node. Note that this level of replication is independent from the system level replication. This can lead to redundancies and cost inefficiencies, because this can effectively lead to a high effective replication factor, which in turn means a much lower ratio of effective data stored in the total system storage capacity.


As noted above, there are a number of issues with standard architectures. For example, the software is heavily layered. A client fetching data must communicate through a large number of software layers, which can be as many as seven or eight layers. These layers are not always designed to interoperate in an optimal way. This translates in reading and writing throughput that are often not optimal. Next, the heavy layering also imposes a fair amount of complexity on the operators. Each of these layers requires complex configuration, optimization, parameterization, and so forth. Furthermore, in these kinds of systems, based on the assumption that the disc inputs and outputs (I/Os) are the effective bottleneck of most storage systems, multiple layers are added on top of one another. This means that different software layers are partly designed to reduce the number of I/O's at the cost of more RAM and/or CPU usage. With upcoming large improvements in disk and flash technologies, this is not going to be the case anymore. Storage I/O's are bound to be a lot faster in a few years, shifting the bottleneck from storage I/O's to network bandwidth and even the CPU. Thus, additional software layers that consume CPU cycles are going to become a hindrance more than a help in storage systems.


Most of these systems have limited scaling capacity. The GFS has built itself around a single-master approach. This means the every client interaction with the system has to go at least once through a single master (replicated for failover but not for load balancing) that contains the useful metadata. Even with lightweight metadata, limited interactions and client caching, the approach scales only to a point as the number of clients grow. Ceph has chosen not to have metadata servers (this is not completely true: metadata servers will need to keep track of the cluster map but this is not usually the main bottleneck of Ceph-based systems). Instead, it places data deterministically by hashing the object name and finding a storage node target according to the hash. While this effectively removes the master bottleneck of GFS, this implies that when a storage nodes is added or removed (voluntarily or upon failure), a non negligible quantity of data has to be moved on the new target node of its deterministic hash. Analytically, the order of magnitude of data that has to be moved is around the capacity of the device added or removed. While this works for small clusters where devices are not often added or removed, this does not easily scale for bigger clusters having numerous big storage nodes.


Furthermore, the capacity of storage devices increases much faster than bandwidth capacity. That is to say, in a few years from now, the network capacity won't be able to sustain adding or removing a petabyte storage node.


The protocols used for intra and inter layers (between metadata servers, storage nodes and clients) create additional overhead for every communication. This overhead is naturally augmented by the complexity of the system. This is because the more layers there are, the more difficult it is to optimize their interactions and deal with exceptional or rare cases without decreasing the overall efficiency.


The disclosure next turns to FIG. 1 which generally describes a computer system, such as a computer client or server.



FIG. 1 illustrates a computing system architecture 100 wherein the components of the system are in electrical communication with each other using a bus 105. Exemplary system 100 includes a processing unit (CPU or processor) 110 and a system bus 105 that couples various system components including the system memory 115, such as read only memory (ROM) 120 and random access memory (RAM) 125, to the processor 110. The system 100 can include a cache of high-speed memory connected directly with, in close proximity to, or integrated as part of the processor 110. The system 100 can copy data from the memory 115 and/or the storage device 130 to the cache 112 for quick access by the processor 110. In this way, the cache can provide a performance boost that avoids processor 110 delays while waiting for data. These and other modules can control or be configured to control the processor 110 to perform various actions. Other system memory 115 may be available for use as well. The memory 115 can include multiple different types of memory with different performance characteristics. The processor 110 can include any general purpose processor and a hardware module or software module, such as module 1132, module 2134, and module 3136 stored in storage device 130, configured to control the processor 110 as well as a special-purpose processor where software instructions are incorporated into the actual processor design. The processor 110 may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc. A multi-core processor may be symmetric or asymmetric.


To enable user interaction with the computing device 100, an input device 145 can represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth. An output device 135 can also be one or more of a number of output mechanisms known to those of skill in the art. In some instances, multimodal systems can enable a user to provide multiple types of input to communicate with the computing device 100. The communications interface 140 can generally govern and manage the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.


Storage device 130 is a non-volatile memory and can be a hard disk or other types of computer readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random access memories (RAMs) 125, read only memory (ROM) 120, and hybrids thereof.


The storage device 130 can include software modules 132, 134, 136 for controlling the processor 110. Other hardware or software modules are contemplated. The storage device 130 can be connected to the system bus 105. In one aspect, a hardware module that performs a particular function can include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as the processor 110, bus 105, display 135, and so forth, to carry out the function.


This disclosure now turns to a more detailed description of the various concepts and examples herein. Among other things, this disclosure proposes a distributed and flexible storage system with a minimal number of layers. Thus, an example architecture can include the removal of the file system 204 shown in FIG. 2 such that the proposed architecture could simply include the application layer 202, an object storage layer 206 in a block storage layer 208. The application layer 202 can represent any application running on top of the storage system 206/208 (210). The system 210 contains a pool of metadata servers and a pool of storage servers.


There are several concepts that apply to the present disclosure. A first concept is that any kind of entity (for example, an object, metadata, a video, a file, and so forth) in the system can be identified and represented by a set of IP addresses, such as an IPv6 address. For example, the IPv6 protocol can provide prefixes which can be used to represent a “group” of such IPv6 addresses. The set of IP addresses can take into account the metadata replicas and the object replicas IPv6 addresses. In one aspect it could be said that the primary metadata replica IPv6 address (the first obtained via hashing) suffices to identify the object.


As the structure contemplated for metadata disclosed herein is an IPv6 address in one aspect, this disclosure shall briefly discuss the structure of an IPv6 address. While IPv6 is not required, and other structures are contemplated, IPv6 is discussed as one embodiment. IPv6 addresses have 128 bits, although for this disclosure, the addresses may have less than 128 significant bits. The design of the IPv6 address space implements a different design philosophy than in IPv4, in which subnetting was used to improve the efficiency of utilization of the small address space. In IPv6, the address space is deemed large enough for the foreseeable future, and a local area subnet most of the time uses 64 bits for the host portion of the address, designated as the interface identifier, while the most-significant remaining bits are used as the routing prefix.


The identifier is only unique within the subnet to which a host is connected. IPv6 has a mechanism for automatic address detection, so that address auto-configuration always produces unique assignments. The 128 bits of an IPv6 address are represented in 8 groups of 16 bits each. Each group is written as four hexadecimal digits and the groups are separated by colons (:). An example of this representation is 2001:0db8:0000:0000:0000:ff00:0042:8329.


An IPv6 packet has two parts: a header and payload. The header consists of a fixed portion with minimal functionality required for all packets and may be followed by optional extensions to implement special features. The fixed header occupies the first 40 bytes (320 bits) of the IPv6 packet. It contains the source and destination addresses, traffic classification options, a hop counter, and the type of the optional extension or payload which follows the header. This Next Header field tells the receiver how to interpret the data that follows the header. If the packet contains options, this field contains the option type of the next option. The “Next Header” field of the last option, points to the upper-layer protocol that is carried in the packet's payload.


Extension headers carry options that are used for special treatment of a packet in the network, e.g., for routing, fragmentation, and for security using the IPsec framework. Without special options, a payload must be less than 64 KB. With a Jumbo Payload option (in a Hop-By-Hop Options extension header), the payload must be less than 4 GB.


Unlike with IPv4, routers never fragment a packet. Hosts are expected to use path maximum transmission unit discovery (PMTUD) to make their packets small enough to reach the destination without needing to be fragmentation. PMTUD is a standardized technique for determining the maximum transmission unit size on the network path between two IPv6 hosts.


An IPv6 address can be abbreviated to shorter notations by application of the following rules. One or more leading zeroes from any groups of hexadecimal digits are removed; this is usually done to either all or none of the leading zeroes. For example, the group 0042 is converted to 42. Another rule is that consecutive sections of zeroes are replaced with a double colon (::). The double colon may only be used once in an address, as multiple use would render the address indeterminate. Some recommend that a double colon must not be used to denote an omitted single section of zeroes.


An example of application of these rules is as follows: Initial address: 2001:0db8:0000:0000:0000:ff00:0042:8329. After removing all leading zeroes in each group: 2001:db8:0:0:0:ff00:42:8329. After omitting consecutive sections of zeroes: 2001:db8::ff00:42:8329. The loopback address, 0000:0000:0000:0000:0000:0000:0000:0001, can be abbreviated to ::1 by using both rules.


Hosts verify the uniqueness of addresses assigned by sending a neighbor solicitation message asking for the Link Layer address of the IP address. If any other host is using that address, it responds. However, MAC addresses are designed to be unique on each network card which minimizes chances of duplication.


The pool of metadata servers that is organized in an architecture, which could be a binary tree but is not limited to such a binary tree. The metadata servers can contain metadata for the objects. They are in one example addressed by a range of IP addresses, each metadata server being assigned an IPv6 prefix (an example of which is defined above), i.e., not a single address. The aggregation of the prefixes of the metadata servers belonging to the same storage domain can be fixed and can be the metadata system IPv6 prefix. For example, if the metadata system prefix is 2001::0/64 and there are 2 metadata servers, they will respectively hold the prefixes 2001::0/65 and 2001::8000:0:0:0/65. If there are 4 metadata servers, they will respectively hold the prefixes 2001::0/66, 2001::4000:0:0:0/66, 2001::8000:0:0:0/66 and 2001::b000:0:0:0/66, hence the possible idea of binary tree.


The structure described herein enables the client to only need to know the metadata system prefix and doesn't apply to storage nodes, as only the metadata servers need to know in advance the storage nodes IPv6 prefixes. The pool of storage nodes can be organized in the same type of architecture. The storage nodes will contain the objects themselves. They are addressed by a range of IP addresses, each storage node being assigned an IPv6 prefix. A storage node is a logical storage device on top of which an application runs that is able to handle requests, assign a unique identifier (such as an IPv6 address or other protocol address) to stored object and retrieve the object according to the unique identifier. The system presented herein is fundamentally not a block storage system as that term is traditionally used for existing systems. It is a native object storage system that behaves as a block storage when the objects all have the same size. There is an advantage to this approach since the present disclosure can support several block storage systems having different block sizes as well as different object storage systems all of them at the same time and possibly sharing the same storage physical backend infrastructure, i.e., the storage nodes.


For example, a /110 prefix with fixed size objects of size 1 MB (for larger typical objects) could address up to approximately 4 TB of data. In another example, a /96 prefix with fixed sized objects of size 8 KB (a typical block size for a file system) could address up to approximately 32 TB of data. Thus, through the usage of different prefixes on the same storage system backend, different block storage systems with different block sizes could be built.


In one aspect, the identifier can be an IPv6 address which can identify the data, and its prefix can be assigned to the storage node itself. For backward compatibility with existing systems, a library, or any equivalent, will be provided to expose a classical object level interface thus keeping the underlying technical details hidden to the application.


Next, example operations of the distributed storage system are described. First, the process of creating an object is described with reference to FIG. 5. FIG. 5 illustrates a system 500 used to create an object. To create the object, the client library 502 computes a family of pseudorandom seeded X-bit hashes based on the object name and consecutive integers as seeds. In one example, X is less than 128 and depends on the length of the metadata system IPv6 prefix 204. The size of the family is the metadata replication factor and is configurable in the library or from another entity. A high replication factor increases the safety of metadata at the cost of a higher storage overhead for metadata and increases latency in the event where all metadata needs to be updated. High replication also increases the load balancing capabilities for metadata access, which is useful for objects accessed concurrently by a high number of clients.


A first metadata server 504 is in charge of creating the metadata. What the metadata contains can be variable and customized. It is possible that the client 502 creating an object must provide some information to the first metadata server 504 for it to construct the metadata. Several pieces of information can be contained within the metadata. The metadata can include the address of all metadata replicas, in the general case they should just be consecutive hashes of the object name. However, the metadata server 504 could refuse to hold some metadata for several reasons, such as a server could be down, could be full, and so forth. In this case, the next pseudorandom seeded hash would be used as a destination.


The metadata servers 504 holding the replicas should actually notify the former replica holders to complete their metadata. The metadata can also include the address of all object replicas and possibly their state, such as whether they are stale or up-to-date. There are two phases for this process. At the object creation, the first metadata replica holder places all object replicas, according to a given policy. The policy can be determined partly by the storage system itself and partly by the client, and this is a customizable process. When the client first receives metadata, the addresses for the object replicas are generic addresses for the storage node that they should be stored on. For example, the first or last address of the prefix of the storage nodes can be assigned. Later, storage nodes assign a unique IPv6 address to every replica and metadata is updated. The metadata can also include the object name, so that hash collisions can be dealt with. These are improbable but still could happen.


The metadata server 504 also represents replica metadata servers. Multiple metadata servers can be utilized for redundancy and load-balancing. A client may submit a metadata request based on their computation of the metadata hash. If that server is down or overloaded, the client can compute another metadata hash and access replica metadata at a different server.


Any other information relevant to the metadata level for the storage system, such as access control lists (ACLs), total duration of the object in a video chunk, and so forth, can also be contained within the metadata. What the metadata contains is highly customizable. However, it should be remembered that at any time, one of the metadata fields can change, and all metadata replicas have to be updated. For example, for a regular file, every write on the file increments the file size. As such, in one aspect, the system does not store the size of the file in the metadata but rather at the beginning of the object containing the text file itself. Utilizing the various information that the metadata should contain can mean that the metadata creation process is dependent upon the desired policy, both for resiliency and consistency.


As shown in FIG. 5, one example approach is that the client 502 sends a create request to a metadata server 504. The metadata server creates the metadata and replication data and returns the metadata and an acknowledgment to the client 502. In one aspect, the request may include some user designated requirements such as a quality of service, a type of storage hardware, a geographic location, accessibility parameters, and so forth. Thus, if the request is an initial request to write an object to a storage node, particular requirements for that process can be articulated. Matching a request for certain parameters with the actual process of writing or reading data can be accomplished in a number of different ways. One example approach could be performed at the first time when the system receives the metadata prefix. When a user or an administrator requests a certain parameter, the metadata server 504 can essentially match that request with storage nodes implementing policies or qualities of service that match the required parameter. Thus, a certain storage node or group of storage nodes may implement policies (certain QOS, hardware type, etc.) that match the requirement in the request and the metadata created by the metadata server will direct the client to those storage nodes.


In an alternate approach, a hybrid storage servers pool can be established in which the client asks for a QOS (or some other parameter) in the create request. In this scenario, the metadata server holding the first metadata replica could provision or establish the storage of the object by selecting the storage nodes fulfilling the requested QoS and implement the required parameters for the object and its replicas, if any.



FIG. 6 illustrates another aspect of this feature with the graphic 600. The client 502 sends the create metadata request. Three different pathways are described which identify a destination through the use of a hash such as dest: hash(name,0), dest: hash(name, 1) and dest: hash(name, 2). The metadata server 504 places replicas of the data in various storage locations. The metadata from the various metadata servers is returned to the client 502.



FIG. 7 illustrates a write operation 700. To write an object, the client library first fetches the object metadata through metadata request to the metadata server 504. To retrieve the metadata, the client computes the X-bit hashes previously mentioned, which gives the IPv6 address of the metadata. For metadata load balancing, the client 502 can compute any of the X-bit hashes of the previous hash family. With the metadata, the client 502 sends a write request with the data to the dedicated storage nodes 506. This can be done in parallel for the client 502 to ensure that data has been written on all storage nodes. It can also be done in sequence so that the client 502 only waits for a number of storage nodes to acknowledge thus reducing latency, at the cost of a slightly higher probability of failure.


An example policy is a classical quorum policy. For a X:Y quorum policy (typical values are 2:3 or 3:5 depending on the resiliency policy), the client 502 writes data on the primary replica. This primary replica then updates the other replicas. The client receives an acknowledgment only when X out of the Y replicas have been written, so that the main replica receives the acknowledgment for at least one of the replicas for a 2:3 policy. This is also influenced by the expected consistency policy, which is: the smaller the ratio of a quorum policy, the fastest the client gets an acknowledgment, so the smallest the latency is, but the highest chance of having inconsistent replicas exists. This can be a problem if the application is reading a replica that hasn't been updated of the just-modified object after having received an acknowledgment for this update. In one example, if the administrator wants high reliability, the system can establish five replicas of an object. The purpose for such a high number of replicas can be for security, load-balancing, and so forth. When the system is storing the five replicas, one may not want to wait until all five replicas are completely stored before sending an acknowledgment. A policy could be established in which the storage nodes are to send an acknowledgment after storing three of the five copies. Then, to ensure that the five replicas are successfully stored, the policy could include, if there are errors in storing the fourth or fifth copies, that repairs can be made from one of the successfully stored first three copies. This provides one nonlimiting example of got the kind of flexibility that can be available in the storage system used utilizing the IPV6-based approach disclosed herein. Because of this flexibility, users can more easily manage the storage of their data. For example, large video files can be stored for further processing or chunked into multiple smaller pieces and because of the manner in which the storage of data is managed as disclosed herein, any approach which is desired can be easily managed using the IPv6 based storage management system. A policy could be established to store one main copy of an object on the SSD and backup copies on a hard drive.



FIG. 8 illustrates another aspect of the write concept. The client 502 sends a metadata request to the metadata server 504. The metadata server 504 returns the metadata to the client 502. The client 502 sends a write request utilizing the metadata, with the data. The example format is dest: storage node 1, dest: storage node 2, etc. The process includes writing the data to the destination node and returning an acknowledgment to the client 502.



FIG. 9 illustrates an example structure 900 for a read process. To read an object, the client retrieves metadata the same way as for a write request. The client 502 sends a read request to one of the storage nodes based on the received metadata from the metadata server 504. The read request is typically the most frequently used request. There may be an opportunity to load balance, either by having the client 502 selecting a random storage node between the storage nodes that contain the object or on the metadata server side by sending only partial metadata containing a subset of the list of storage nodes holding the object. In some cases, the client 502 can store the metadata for an object. If the client recently retrieve the audit object from storage and has a metadata in its cache, the client could simply submit a read request again to the storage node 506 without requesting the metadata from the metadata server 504.



FIG. 10 illustrates the client 502 sending the metadata request to the metadata server 504 that returns the metadata or, as noted above, partial metadata containing a subset of the list of storage nodes that hold a particular object. Utilizing the metadata, the client 502 sends a read request to the storage node X, or multiple storage nodes, which return the object to the client.


Note that all these operations are transparent to the user application. A client library provides a regular object storage interface to applications using it. This library is configurable to allow for different policies regarding data safety, data and metadata placement, and so forth. The backend storage requires no configuration or change for it to provide different policies on this matter, which greatly simplifies the administration.



FIG. 11 illustrates an overall system 1100 which includes a logical view of the responsibilities of the various actors in the storage program. The client device 502 initiates the request to perform such functions as close, open, write, read, destroy. The client library associated with the client device 502 allows for customizable configuration, computes metadata hashes, interacts with metadata servers and storage nodes, and provides an object storage semantic interface to the client application. The metadata management occurs between the client device 502 and the metadata server 504. The metadata server 504 can be considered a request handler and creates the metadata, oversees the metadata servers and storage nodes repair when required, places data based on systemwide metrics and possibly client recommendations, and can perform load-balancing of data requests and access. The data placement and metadata update occurs between the metadata servers 504 and the storage nodes 506.


The storage node 506 represents request handlers that store the data, and assigns a unique IPv6 identifier to each content replica stored. The data writing and reading occurs between the client device 502 and the storage node 0506.


One example of the system disclosed in FIG. 11 is as follows. A system can include at least one storage node and at least one metadata server, wherein the system is configured to communicate with a client device and the at least one storage node for managing a storing of objects. The at least one metadata server can be configured to receive a request to create metadata associated with an object to be stored, wherein the request comprises a computed metadata hash that is computed at the client device, create the metadata in response to the request and place the object for storage at the storage node based on at least one of system-wide metrics or a client recommendation. Other factors can be used as well for making storage placement decisions. Such can include, but are not limited to, one or more of quality of service requirements, access control lists, load-balancing, premium pricing, user priority, user profile data, data regarding performance of the storage system, and so forth. The at least one storage node can be configured to receive and store the object and assign a unique IPv6 identifier to each replica of the object.


Other characteristics of the storage system include the following features. The system is flexible. Different IPv6 prefixes can be assigned for different types of storage, such as flash, hard drive, and so forth. This effectively makes the nature of the performance of storage almost transparent to the system by just addressing different stores types by selecting different addresses. In a cloud scenario where there are multiple tenants, a metadata prefix can be assigned to each tenant, so that isolation is insured and no collision happens between different users. Through the use of different prefixes, the system can also support different policies for replication, repair, load-balancing, data placement, ready partition of contents, and so forth. The used hashing function can vary to fit the properties that one request for a specific application. For example, a hashing function can be designed to yield close hashes for objects with similar names or oppositely to have a cascade effect, which means that two names with only a bit different can give completely different hashes. In another aspect, the flexibility of the system is provided by the metadata being customized. The metadata can contain enough information to locate the contents, but could also contain many other parameters or types of information, such as access control lists, size, duration a video chunk, or any other type of object metadata.


The system disclosed herein is also not heavily layered. For example, the IPv6-centric design allows for the client to directly connect to the metadata servers and the storage nodes without the need for inter-node communication and a complex metadata maintenance process. The system map is the network itself and does not have to be consistently maintained across all nodes, or shared with clients, and so forth. The client accesses data much like it would do when it uses a simple file system. The client first batches metadata (a functional equivalent of data stored in the filesystem inodes) that give the client information on where the data or the blocks are. In the file system case, the metadata stays totally hidden to the client.


Other benefits of the approach disclosed above include that it does not consume much bandwidth. The approach does not use much bandwidth because the bandwidth is used as almost exclusively dedicated to data transmission between clients and servers and repairs, if need be. The only overhead is the very lightweight request protocol for signaling messages and metadata migration when new servers are added to the metadata server pool. This migration can be caused by the metadata system prefix being static, which means adding or removing metadata servers requires a change in the metadata servers prefix allocation, which in turn leads to a migration to fit the new distribution. This is similar to the Ceph example, only the concept is metadata instead of data. Note that metadata are usually much more lightweight than the content itself, and the metadata can go as small as a few KB for a multi-GB object. This effectively prevents the kind of overhead that other systems suffer when adding a new server.


Another benefit of the approach is that it is easily manageable. Different policies can be defined for multiple aspect of the whole system and each one of them is associated with an IPv6 prefix. For example, one can imagine a system with two prefixes. One prefix would be dedicated to highly requested and often accessed objects, that have a high replication factor. For example, for load-balancing and resiliency sake, such objects could be highly replicated. The objects are stored on expansive flash discs, or on other dedicated storage means compared to regular object with a smaller replication factor and stored on traditional hard disks, corresponding to the other prefix. Another use could be to give a prefix to each data center and to force the system to store replicas of both metadata and data on different prefixes to have the whole data center level of resiliency. A further use case could be to provide different qualities of service to different types of users for a content delivery network (CDN): some premium range of users would have access to some prefix that has a small capacity caches (well distributed) that store a high quality video content, while regular users would only be allowed to access medium quality content stored under another prefix namespace.


In another aspect, accessing the content can be easily monitored by simple traffic inspection. Each IPv6 address corresponds to one content. Thus, obtaining valuable information about the storage system itself is a very simple and effective task. The use of the IPv6 structure (or similar structure) to address content is an effective feature of an example distributed storage system. The fact that a client request can be routed all the way down to the actual object means that there is no need for redundant communication between nodes to know how to reach any specific content. It also means that every existing Layer 3 tool can be used for different purposes. For example, segment routing can benefit the system for load-balancing between paths by just giving a client a segment routing list instead of just an address pointing to the content. Additionally, the system map is fully stored in the network itself, contrary to other storage systems, where different system maps have to be created, maintained, kept consistent between devices, regularly updated, distributed to every client, and so forth. In the system disclosed herein, the client library only has to know IPv6 prefixes corresponding to different policies. This prefix is static, and each element of the underlying architecture is transparently addressed by the network. The addition or the removal of new storage devices only translates in a few route changes in the datacenter routers and a light metadata rebalancing to fit the new architecture, which are operations the clients are oblivious to. Furthermore, aggregating statistics about network flows is easy and can be done in the network layer. For traditional distribution storage systems, the system itself has to support and integrate analytics tools for this purpose that can be complex and require additional resources on each node.


Another benefit of the approach disclosed herein is that it is easy to build upon. The design of the storage system allows for the incorporation of erasure coding techniques, such techniques encode data in several fragments that are distributed amongst different nodes and have a non-integer replication overhead. One can typically achieve the same resiliency in a 1, 4 ratio. For example, 14 encoded shards from data originally striped in 10 fragments present a better resiliency level than 3 stored replicas, provided they are stored on different storage nodes, effectively more than halving the storage overhead. One could incorporate the encoding information in the metadata into the disclosed storage system rather than just a replica's location. This comes, however, with a price in that the encoding utilized will consume computing resources, require complex metadata and add an encoding/decoding latency. Any traditional authentication method can also be implemented so that metadata servers verify the identity of clients before responding to any query.


The system can be an IPv6 centric distributed storage system that builds around the pool of organized metadata servers and a pool of storage nodes. Both pools are flat pools and do not follow a master/slave architecture so that there is no artificial bottleneck or single point of failure. The system can be globally fully distributed and resilient (No SPOF) and can support elaborated load-balancing policies as well as various deployment models. The system's applicability can include, without limitation, from DC Central storage system to a fully distributed storage suitable for IoE applications.



FIG. 12 presents an example object storage architecture 1200 which can be contrasted with the architecture 1300 shown in FIG. 13. Based on its fully distributed architecture 1300, there is no system wide bottleneck, especially during metadata access (contrary to 1202 for the map or GFS for the singe master access). The system 1202 includes monitors 1206 that maintain the cluster map, distribute the cluster map to clients, reach consensus through Paxos and send a fetched cluster map to the client 1204 each time it changes. In system 1202, a map has to identify every node and the monitors have to have knowledge of every node in the system. The more nodes in the system, the more complex the map becomes. In the system 1202, the client 1204 knows the monitors addresses and knows the hashing function. The client 1204 writes and reads to the database 1210 and directly contacts the object storage devices (OSDs) following the current cluster map and the object hash. The OSDs 1210 store the objects, are organized in a cluster map, and deal with the replication issue. An administrator 1208 sets up the cluster and administrates the nodes.



FIG. 13 illustrates another aspect of the present disclosure. The general concept is to move everything regarding managing the storage of data into the network. Every part of the architecture 1300 can be upscaled by adding more dedicated servers 1310 without impacting the rest of the system. The storage nodes 1310 store the objects, are organized in a cluster map, and deal with replication. The client 1304 retrieves object metadata from the metadata nodes 1306, which store the object metadata and answer the requests addressed to a sub-prefix, and place data following a placement policy which can be dynamic. The client 1304 knows the metadata prefix and the hashing function. The client only needs to know the single metadata prefix to address the whole storage domain which simplifies the amount of data the client needs. The client 1304 writes and reads to the storage nodes 1310 and sends requests to IPv6 addresses stored in object metadata. An administrator 1308 sets up the cluster, maintains a cluster map, and administrates the notes.


A difference between existing storage systems. Everything is identified through the IPv6 protocols or headers. This means that the data is accessible through the network. The client only needs to know the IPv6 prefix of the storage domain that the client links to. This simplifies the most—the client does not have to know what is behind the prefix. It could be one node or 100 nodes. The simplicity is achieved by using the pseudorandom function which hashes the object name and outputs an IPv6 address which has a prefix. When the client desires to write or receive an object, it only has to hash the object name which will give a metadata address, which address will be inside the metadata system prefix. The client will send the request to the address, which address will correspond to a sub-prefix that is held by a metadata server. In FIG. 13, the “M/80” prefix would be split into three as there are three nodes shown (by way of example). Every node will have contiguous prefixes. If you add a new metadata node, to increase the size of the system, you just split the prefix again, and give a part of the prefix to a new metadata node, which will not change the prefix. The metadata system prefix is fixed. The metadata is very light, which is a small amount of information. The metadata might be the size or the type of object. Thus, in FIG. 13, the prefix (M/80) is fixed and the information is simple, rather than a complex cluster map that has to be updated. Additionally, in FIG. 12, the system hashes the object name to find where you store the object. In FIG. 13, you has the object name to find where you store the metadata. In FIG. 13 you balance light metadata when a metadata server is added or removed, whereas the system in FIG. 12, the system balances the objects when a new OSD is added or removed.


Only metadata are redstributed when a new metadata node is added or removed (unlike system 1202), which keeps the network usage overhead almost minimal during maintenance operations in the datacenter. The number of software layers is kept minimal (unlike other distributed storage systems). This means that computing resources are kept minimal. The voluntarily clean and simple design of the system, as well as its complexity with the network means that there is no need to maintain different maps, distribute them amongst nodes and keep them consistent. The simplicity of the present system also allows for a highly flexible storage system that permits the administrator 1308 to very easily and transparently define totally different policies for different parts of the storage system. The IPv6-centric design of this system, in addition to allowing for the simplicity and flexibility that is already pointed out, makes the administration and analysis of the storage system very easy through the use of unique IPv6 addresses for content. All of these characteristics drastically improve read/write performance and allow for high throughput, easy management and simple analytics gathering.


The system disclosed herein can be independent from the entities using it (the client) and aims at storing very generic data that can range from very small to very large data. It is not simply a mesh network of storage nodes. Furthermore, the disclosure proposes not only to identify data by their IPv6 identifies, but also to classify storage classes by IPv6 prefixes, meaning that instead of combining just identification and location, the disclosure combines identification, location and QoS required for the data. This is an important step in facilitating the maintenance and the organization of a generic distributed storage system, e.g. a data center.


Storing replicas on highly different locations ensures that a localized power failure or accident won't bring down all the available replicas. This is why the ability to define several metadata prefixes can be used to define different failure domains. The approaches disclosed herein can shift the complexity of maintaining a cluster map that has to be distributed to clients and kept self-coherent to the network. In some cases, only the orchestrator (that clients do not access) has to maintain a structure that allocates IPv6 prefixes to metadata nodes. Other than that, clients may only require limited, static information bits, like the system metadata IPv6 prefix or other configuration details, which typically may not change over time. Additionally, this information may not be ‘consciously’ owned or managed by the client since it may be a client library configuration parameter. The capacity to scale the metadata side of the storage system is advantageous because it allows much more flexibility in the storage system designs as well as for operating the system. It can almost indefinitely grow whereas other systems which have a single master node will always in the end be bounded by the capacity of the master node to deal with all incoming requests.


In some cases, the storage for metadata as well as actual client data can be scaled independently, to allow for widely different scenarios (few big objects, small metadata and few metadata servers; numerous small objects, comparable size metadata, and numerous metadata servers). With large amounts of data in the storage system, under the approach in FIG. 12, when one adds new nodes to the storage system, terabytes of data may need to be moved, which is very cumbersome. Thus, scaling and redistribution of data becomes problematic. Using IPv6 is used to provide more finely tuned data for the system. Different policies could be applied to different prefixes. An orchestrator can manage those policies for particular prefixes. Thus, a certain quality of service, or certain hardware profile, or geographic location, could be associated with a certain prefix or sub prefix. Thus, when a client seeks the object metadata for accessing the object or writing the object, the orchestrator can apply the policy for that zone or that prefix. The whole storage system is managed by using different IPv6 prefixes and thus all of the complexity is within the network.


In one example, geographic control or policies can be implemented. For example, if an administrator wants storage nodes to be established across a particular geographic location, such as California and Alaska, the system can assign an IPv6 prefix or prefixes to storage nodes in those geographic locations. Once the assignments are made, policies can be established to route or distribute objects to be stored on those particular nodes in those geographic locations. Thus, through the assignment of addresses or IPv6 prefixes, one can manage the geographic topology of a network in an efficient manner. Another advantage is that identifying storage locations with IPv6 prefixes can also simplify the view of the network in that even if data is moved from one physical storage node to another, the logical view of the stored objects can remain the same. The physical location of the data does not matter.



FIG. 14 illustrates a method embodiment. The method embodiment includes receiving, at a computing device, a request from a client to create metadata associated with an object (1402), creating the metadata based on the request (1404) and transmitting the metadata and an acknowledgment to the client. The metadata can contain an address in a storage system for each replica of the object and can be used to write data to the storage system and read the data from the storage system (1406). The system can open up a connection between the client 502 and the metadata server 504 according to the address. If the node 504 is down, there are metadata replicas (for replication as well as load balancing), so the client can make another request by computing another metadata hash. With the created metadata from the metadata server, a connection can be established between the client 502 and the storage node 506 for writing or retrieving data. The storage node 506 will, in a write scenario, assign a unique IPv6 identifier to each version of the object stored. If the operation is a read operation, the storage node 506 will provide the data to the client device 502.


No filesystem layer is required between the application layer and the storage system. The storage system contains the pool of metadata servers and the pool of stored servers. Writing and reading the data from the storage system can be accomplished via an IPv6 address stored in the metadata. IPv6 prefixes can be used to represent a group of addresses or can be used to provide tailored writing or reading to or from the storage system according to a policy. In another aspect, a metadata prefix can be assigned to each tenant in a multi-tenant environment. The client can compute a family of pseudorandom seeded hashes based on an object name or consecutive integers as seeds. The client can also compute a family of pseudorandom seeded X-bit hashes based on an object name, wherein X is less than or equal to 128. In another aspect, the metadata, when used to write the data to the storage system, is utilized to write replica data to the storage system. For example, the metadata can include information for writing replica data, including identification information for the replicate data and storage system.


The method can further include, by the computing device, determining where to store the data on the storage system based on a placement policy, system-wide metrics, client recommendations, and/or quality of service requirements. Other factors can also be considered for determining where to store data, such as state information, conditions, statistics, preferences, data characteristics, etc. The method can also assign an IPv6 address to the data, which can identify the data and the location of the data. As previously explained, the method can also use prefixes, such as IPv6 prefixes for storing, maintaining, identifying, and/or classifying metadata, metadata servers, data storage nodes, objects, data characteristics, data requirements, tenants, etc.


The distributed storage system above can be described as a native object storage system which behaves as a block storage system when the objects all have the same size.


In some embodiments the computer-readable storage devices, mediums, and memories can include a cable or wireless signal containing a bit stream and the like. However, when mentioned, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.


Methods according to the above-described examples can be implemented using computer-executable instructions that are stored or otherwise available from computer readable media. Such instructions can comprise, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Portions of computer resources used can be accessible over a network. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, or source code. Examples of computer-readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on.


Devices implementing methods according to these disclosures can comprise hardware, firmware and/or software, and can take any of a variety of form factors. Typical examples of such form factors include laptops, smart phones, small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on. Functionality described herein also can be embodied in peripherals or add-in cards. Such functionality can also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.


The instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are means for providing the functions described in these disclosures.


Although a variety of examples and other information was used to explain aspects within the scope of the appended claims, no limitation of the claims should be implied based on particular features or arrangements in such examples, as one of ordinary skill would be able to use these examples to derive a wide variety of implementations. Further and although some subject matter may have been described in language specific to examples of structural features and/or method steps, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to these described features or acts. For example, such functionality can be distributed differently or performed in components other than those identified herein. Rather, the described features and steps are disclosed as examples of components of systems and methods within the scope of the appended claims. Moreover, claim language reciting “at least one of” a set indicates that one member of the set or multiple members of the set satisfy the claim.


It should be understood that features or configurations herein with reference to one embodiment or example can be implemented in, or combined with, other embodiments or examples herein. That is, terms such as “embodiment”, “variation”, “aspect”, “example”, “configuration”, “implementation”, “case”, and any other terms which may connote an embodiment, as used herein to describe specific features or configurations, are not intended to limit any of the associated features or configurations to a specific or separate embodiment or embodiments, and should not be interpreted to suggest that such features or configurations cannot be combined with features or configurations described with reference to other embodiments, variations, aspects, examples, configurations, implementations, cases, and so forth. In other words, features described herein with reference to a specific example (e.g., embodiment, variation, aspect, configuration, implementation, case, etc.) can be combined with features described with reference to another example. Precisely, one of ordinary skill in the art will readily recognize that the various embodiments or examples described herein, and their associated features, can be combined with each other.


A phrase such as an “aspect” does not imply that such aspect is essential to the subject technology or that such aspect applies to all configurations of the subject technology. A disclosure relating to an aspect may apply to all configurations, or one or more configurations. A phrase such as an aspect may refer to one or more aspects and vice versa. A phrase such as a “configuration” does not imply that such configuration is essential to the subject technology or that such configuration applies to all configurations of the subject technology. A disclosure relating to a configuration may apply to all configurations, or one or more configurations. A phrase such as a configuration may refer to one or more configurations and vice versa. The word “exemplary” is used herein to mean “serving as an example or illustration.” Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs.


Moreover, claim language reciting “at least one of” a set indicates that one member of the set or multiple members of the set satisfy the claim. For example, claim language reciting “at least one of A, B, and C” or “at least one of A, B, or C” means A alone, B alone, C alone, A and B together, A and C together, B and C together, or A, B and C together.

Claims
  • 1. A method comprising: receiving, at a computing device, a request to create metadata associated with an object from a client;creating the metadata based on the request; andtransmitting the metadata and an acknowledgment to the client;wherein the metadata contains an address in a storage system for each replica of the object and wherein the metadata can be used to write data to the storage system and read the data from the storage system;wherein the metadata comprises an IPv6 prefix for a group of IPv6 addresses which are assigned to a the replicas and the object.
  • 2. The method of claim 1, wherein writing and reading the data from the storage system is accomplished via an IPv6 address stored in the metadata.
  • 3. The method of claim 1, wherein the client computes a family of pseudorandom seeded hashes based on at least one of an object name and consecutive integers as seeds.
  • 4. The method of claim 1, wherein the client computes a family of pseudorandom seeded X-bit hashes based on an object name, wherein X is less than or equal to 128.
  • 5. The method of claim 1, wherein the metadata, when used to write the data to the storage system, is utilized to write replica data to the storage system.
  • 6. The method of claim 1, further comprising, by the computing device, determining where to store the data on the storage system based on one or more of a placement policy, system-wide metrics, a client recommendation, and quality of service requirements.
  • 7. The method of claim 1, wherein a metadata prefix is assigned to each tenant in a multi-tenant environment.
  • 8. The method of claim 1, wherein no filesystem layer exists between an application layer and a storage system layer.
  • 9. A non-transitory computer-readable storage device storing instructions which, when executed by at least one processor, cause the at least one processor to perform operations comprising: establishing a static rule to distribute a first flow to a first server and a second flow to a second server;receiving a request to create metadata associated with an object from a client;creating the metadata based on the request, wherein the metadata comprises an address for each replica of the object; andtransmitting the metadata and an acknowledgment to the client, wherein the metadata contains the address in a storage system for each replica of the object and wherein the metadata can be used to write data to the storage system and read the data from the storage system;wherein the metadata comprises an IPv6 prefix for a group of IPv6 addresses which are assigned to a the replicas and the object.
  • 10. The non-transitory computer-readable storage device of claim 9, wherein the address is an Ipv6 address, and wherein writing and reading the data from the storage system is accomplished via the Ipv6 address stored in the metadata.
  • 11. The non-transitory computer-readable storage device of claim 9, wherein the client computes a family of pseudorandom seeded hashes based on at least one of an object name and consecutive integers as seeds.
  • 12. The non-transitory computer-readable storage device of claim 9, wherein the client computes a family of pseudorandom seeded X-bit hashes based on an object name, wherein X is less than or equal to 128.
  • 13. The non-transitory computer-readable storage device of claim 9, wherein the non-transitory computer-readable storage device stores further instructions which, when executed by the at least one processor, cause the at least one processor to perform further operations comprising: determining where to store the data on the storage system based on one or more of a placement policy, system-wide metrics, a client recommendation, and quality of service requirements.
  • 14. A system comprising: at least one non-transitory computer readable medium storing instructions:at least one processor programmed to cooperate with the instructions to perform operations comprising: receiving, at a computing device, a request to create metadata associated with an object from a client;creating the metadata based on the request; andtransmitting the metadata and an acknowledgment to the client;wherein the metadata contains an address in a storage system for each replica of the object and wherein the metadata can be used to write data to the storage system and read the data from the storage system;wherein the metadata comprises an Ipv6 prefix for a group of Ipv6 addresses which are assigned to a the replicas and the object.
  • 15. The system of claim 14, wherein writing and reading the data from the storage system is accomplished via an Ipv6 address stored in the metadata.
  • 16. The system of claim 14, wherein the client computes a family of pseudorandom seeded hashes based on at least one of an object name and consecutive integers as seeds.
  • 17. The system of claim 14, wherein the client computes a family of pseudorandom seeded X-bit hashes based on an object name, wherein X is less than or equal to 128.
  • 18. The system of claim 14, wherein the metadata, when used to write the data to the storage system, is utilized to write replica data to the storage system.
US Referenced Citations (576)
Number Name Date Kind
4688695 Hirohata Aug 1987 A
5263003 Cowles et al. Nov 1993 A
5339445 Gasztonyi Aug 1994 A
5430859 Norman et al. Jul 1995 A
5457746 Dolphin Oct 1995 A
5535336 Smith et al. Jul 1996 A
5588012 Oizumi Dec 1996 A
5617421 Chin et al. Apr 1997 A
5680579 Young et al. Oct 1997 A
5690194 Parker et al. Nov 1997 A
5740171 Mazzola et al. Apr 1998 A
5742604 Edsall et al. Apr 1998 A
5764636 Edsall Jun 1998 A
5809285 Hilland Sep 1998 A
5812814 Sukegawa Sep 1998 A
5812950 Tom Sep 1998 A
5838970 Thomas Nov 1998 A
5999930 Wolff Dec 1999 A
6035105 McCloghrie et al. Mar 2000 A
6043777 Bergman et al. Mar 2000 A
6101497 Ofek Aug 2000 A
6148414 Brown et al. Nov 2000 A
6185203 Berman Feb 2001 B1
6188694 Fine et al. Feb 2001 B1
6202135 Kedem et al. Mar 2001 B1
6208649 Kloth Mar 2001 B1
6209059 Ofer et al. Mar 2001 B1
6219699 McCloghrie et al. Apr 2001 B1
6219753 Richardson Apr 2001 B1
6223250 Yokono Apr 2001 B1
6226771 Hilla et al. May 2001 B1
6260120 Blumenau et al. Jul 2001 B1
6266705 Ullum et al. Jul 2001 B1
6269381 St. Pierre et al. Jul 2001 B1
6269431 Dunham Jul 2001 B1
6295575 Blumenau et al. Sep 2001 B1
6400730 Latif et al. Jun 2002 B1
6408406 Parris Jun 2002 B1
6542909 Tamer et al. Apr 2003 B1
6542961 Matsunami et al. Apr 2003 B1
6553390 Gross et al. Apr 2003 B1
6564252 Hickman et al. May 2003 B1
6647474 Yanai et al. Nov 2003 B2
6675258 Bramhall et al. Jan 2004 B1
6683883 Czeiger et al. Jan 2004 B1
6694413 Mimatsu et al. Feb 2004 B1
6708227 Cabrera et al. Mar 2004 B1
6715007 Williams et al. Mar 2004 B1
6728791 Young Apr 2004 B1
6772231 Reuter et al. Aug 2004 B2
6820099 Huber et al. Nov 2004 B1
6847647 Wrenn Jan 2005 B1
6848759 Doornbos et al. Feb 2005 B2
6850955 Sonoda et al. Feb 2005 B2
6876656 Brewer et al. Apr 2005 B2
6880062 Ibrahim et al. Apr 2005 B1
6898670 Nahum May 2005 B2
6907419 Pesola et al. Jun 2005 B1
6912668 Brown et al. Jun 2005 B1
6952734 Gunlock et al. Oct 2005 B1
6976090 Ben-Shaul et al. Dec 2005 B2
6978300 Beukema et al. Dec 2005 B1
6983303 Pellegrino et al. Jan 2006 B2
6986015 Testardi Jan 2006 B2
6986069 Oehler et al. Jan 2006 B2
7051056 Rodriguez-Rivera et al. May 2006 B2
7069465 Chu et al. Jun 2006 B2
7073017 Yamamoto Jul 2006 B2
7108339 Berger Sep 2006 B2
7149858 Kiselev Dec 2006 B1
7171514 Coronado et al. Jan 2007 B2
7171668 Molloy et al. Jan 2007 B2
7174354 Andreasson Feb 2007 B2
7200144 Terrell et al. Apr 2007 B2
7222255 Claessens et al. May 2007 B1
7237045 Beckmann et al. Jun 2007 B2
7240188 Takata et al. Jul 2007 B2
7246260 Brown et al. Jul 2007 B2
7266718 Idei et al. Sep 2007 B2
7269168 Roy et al. Sep 2007 B2
7277431 Walter et al. Oct 2007 B2
7277948 Igarashi et al. Oct 2007 B2
7305658 Hamilton et al. Dec 2007 B1
7328434 Swanson et al. Feb 2008 B2
7340555 Ashmore et al. Mar 2008 B2
7346751 Prahlad et al. Mar 2008 B2
7352706 Klotz et al. Apr 2008 B2
7353305 Pangal et al. Apr 2008 B2
7359321 Sindhu et al. Apr 2008 B1
7383381 Faulkner et al. Jun 2008 B1
7403987 Marinelli et al. Jul 2008 B1
7433326 Desai et al. Oct 2008 B2
7433948 Edsall Oct 2008 B2
7434105 Rodriguez-Rivera et al. Oct 2008 B1
7441154 Klotz et al. Oct 2008 B2
7447839 Uppala Nov 2008 B2
7487321 Muthiah et al. Feb 2009 B2
7500053 Kavuri et al. Mar 2009 B1
7512744 Banga et al. Mar 2009 B2
7542681 Cornell et al. Jun 2009 B2
7558872 Senevirathne et al. Jul 2009 B1
7587570 Sarkar et al. Sep 2009 B2
7631023 Kaiser et al. Dec 2009 B1
7643505 Colloff Jan 2010 B1
7654625 Amann et al. Feb 2010 B2
7657796 Kaiser et al. Feb 2010 B1
7668981 Nagineni, Sr. et al. Feb 2010 B1
7669071 Cochran et al. Feb 2010 B2
7689384 Becker Mar 2010 B1
7694092 Mizuno Apr 2010 B2
7697554 Ofer et al. Apr 2010 B1
7706303 Bose et al. Apr 2010 B2
7707481 Kirschner et al. Apr 2010 B2
7716648 Vaidyanathan et al. May 2010 B2
7752360 Galles Jul 2010 B2
7757059 Ofer et al. Jul 2010 B1
7774329 Peddy et al. Aug 2010 B1
7774839 Nazzal Aug 2010 B2
7793138 Rastogi et al. Sep 2010 B2
7840730 D'Amato et al. Nov 2010 B2
7843906 Chidambaram et al. Nov 2010 B1
7895428 Boland, IV et al. Feb 2011 B2
7904599 Bennett Mar 2011 B1
7930494 Goheer et al. Apr 2011 B1
7975175 Votta et al. Jul 2011 B2
7979670 Saliba et al. Jul 2011 B2
7984259 English Jul 2011 B1
8031703 Gottumukkula et al. Oct 2011 B2
8032621 Upalekar et al. Oct 2011 B1
8051197 Mullendore et al. Nov 2011 B2
8086755 Duffy, IV et al. Dec 2011 B2
8161134 Mishra et al. Apr 2012 B2
8196018 Forhan et al. Jun 2012 B2
8205951 Boks Jun 2012 B2
8218538 Chidambaram et al. Jul 2012 B1
8230066 Heil Jul 2012 B2
8234377 Cohn Jul 2012 B2
8266238 Zimmer et al. Sep 2012 B2
8272104 Chen et al. Sep 2012 B2
8274993 Sharma et al. Sep 2012 B2
8290919 Kelly et al. Oct 2012 B1
8297722 Chambers et al. Oct 2012 B2
8301746 Head et al. Oct 2012 B2
8335231 Kloth et al. Dec 2012 B2
8341121 Claudatos et al. Dec 2012 B1
8345692 Smith Jan 2013 B2
8352941 Protopopov et al. Jan 2013 B1
8392760 Kandula et al. Mar 2013 B2
8442059 de la Iglesia et al. May 2013 B1
8479211 Marshall et al. Jul 2013 B1
8495356 Ashok et al. Jul 2013 B2
8514868 Hill Aug 2013 B2
8532108 Li et al. Sep 2013 B2
8560663 Baucke et al. Oct 2013 B2
8619599 Even Dec 2013 B1
8626891 Guru et al. Jan 2014 B2
8630983 Sengupta et al. Jan 2014 B2
8660129 Brendel et al. Feb 2014 B1
8661299 Ip Feb 2014 B1
8677485 Sharma et al. Mar 2014 B2
8683296 Anderson et al. Mar 2014 B2
8706772 Hartig et al. Apr 2014 B2
8719804 Jain May 2014 B2
8725854 Edsall May 2014 B2
8768981 Milne et al. Jul 2014 B1
8775773 Acharya et al. Jul 2014 B2
8793372 Ashok et al. Jul 2014 B2
8805918 Chandrasekaran et al. Aug 2014 B1
8805951 Faibish et al. Aug 2014 B1
8832330 Lancaster Sep 2014 B1
8855116 Rosset et al. Oct 2014 B2
8856339 Mestery et al. Oct 2014 B2
8868474 Leung et al. Oct 2014 B2
8887286 Dupont et al. Nov 2014 B2
8898385 Jayaraman et al. Nov 2014 B2
8909928 Ahmad et al. Dec 2014 B2
8918510 Gmach et al. Dec 2014 B2
8918586 Todd et al. Dec 2014 B1
8924720 Raghuram et al. Dec 2014 B2
8930747 Levijarvi et al. Jan 2015 B2
8935500 Gulati et al. Jan 2015 B1
8949677 Brundage et al. Feb 2015 B1
8996837 Bono et al. Mar 2015 B1
9003086 Schuller et al. Apr 2015 B1
9007922 Mittal et al. Apr 2015 B1
9009427 Sharma et al. Apr 2015 B2
9009704 McGrath et al. Apr 2015 B2
9075638 Barnett et al. Jul 2015 B2
9141554 Candelaria Sep 2015 B1
9141785 Mukkara et al. Sep 2015 B2
9164795 Vincent Oct 2015 B1
9176677 Fradkin et al. Nov 2015 B1
9201704 Chang et al. Dec 2015 B2
9203784 Chang et al. Dec 2015 B2
9207882 Rosset et al. Dec 2015 B2
9207929 Katsura Dec 2015 B2
9213612 Candelaria Dec 2015 B2
9223564 Munireddy et al. Dec 2015 B2
9223634 Chang et al. Dec 2015 B2
9244761 Yekhanin et al. Jan 2016 B2
9250969 Lager-Cavilla et al. Feb 2016 B2
9264494 Factor et al. Feb 2016 B2
9270754 Iyengar et al. Feb 2016 B2
9280487 Candelaria Mar 2016 B2
9304815 Vasanth et al. Apr 2016 B1
9313048 Chang et al. Apr 2016 B2
9374270 Nakil et al. Jun 2016 B2
9378060 Jansson et al. Jun 2016 B2
9396251 Boudreau et al. Jul 2016 B1
9448877 Candelaria Sep 2016 B2
9471348 Zuo et al. Oct 2016 B2
9501473 Kong et al. Nov 2016 B1
9503523 Rosset et al. Nov 2016 B2
9565110 Mullendore et al. Feb 2017 B2
9575828 Agarwal et al. Feb 2017 B2
9582377 Dhoolam et al. Feb 2017 B1
9614763 Dong et al. Apr 2017 B2
9658868 Hill May 2017 B2
9658876 Chang et al. May 2017 B2
9727588 Ostapovicz Aug 2017 B1
9733868 Chandrasekaran et al. Aug 2017 B2
9763518 Charest et al. Sep 2017 B2
9830240 George et al. Nov 2017 B2
9853873 Dasu et al. Dec 2017 B2
20020049980 Hoang Apr 2002 A1
20020053009 Selkirk et al. May 2002 A1
20020073276 Howard et al. Jun 2002 A1
20020083120 Soltis Jun 2002 A1
20020095547 Watanabe et al. Jul 2002 A1
20020103889 Markson et al. Aug 2002 A1
20020103943 Lo et al. Aug 2002 A1
20020112113 Karpoff et al. Aug 2002 A1
20020120741 Webb et al. Aug 2002 A1
20020138675 Mann Sep 2002 A1
20020156971 Jones et al. Oct 2002 A1
20030023885 Potter et al. Jan 2003 A1
20030026267 Oberman et al. Feb 2003 A1
20030055933 Ishizaki et al. Mar 2003 A1
20030056126 O'Connor et al. Mar 2003 A1
20030065986 Fraenkel et al. Apr 2003 A1
20030084359 Bresniker et al. May 2003 A1
20030118053 Edsall et al. Jun 2003 A1
20030131105 Czeiger et al. Jul 2003 A1
20030131165 Asano et al. Jul 2003 A1
20030131182 Kumar et al. Jul 2003 A1
20030140134 Swanson et al. Jul 2003 A1
20030140210 Testardi Jul 2003 A1
20030149763 Heitman et al. Aug 2003 A1
20030154271 Baldwin et al. Aug 2003 A1
20030159058 Eguchi et al. Aug 2003 A1
20030174725 Shankar Sep 2003 A1
20030189395 Doornbos et al. Oct 2003 A1
20030210686 Terrell et al. Nov 2003 A1
20040024961 Cochran et al. Feb 2004 A1
20040030857 Krakirian et al. Feb 2004 A1
20040039939 Cox et al. Feb 2004 A1
20040054776 Klotz et al. Mar 2004 A1
20040057389 Klotz et al. Mar 2004 A1
20040059807 Klotz et al. Mar 2004 A1
20040088574 Walter et al. May 2004 A1
20040117438 Considine et al. Jun 2004 A1
20040123029 Dalai et al. Jun 2004 A1
20040128470 Hetzler et al. Jul 2004 A1
20040128540 Roskind Jul 2004 A1
20040153863 Klotz et al. Aug 2004 A1
20040190901 Fang Sep 2004 A1
20040215749 Tsao Oct 2004 A1
20040230848 Mayo et al. Nov 2004 A1
20040250034 Yagawa et al. Dec 2004 A1
20050033936 Nakano et al. Feb 2005 A1
20050036499 Dutt et al. Feb 2005 A1
20050050211 Kaul et al. Mar 2005 A1
20050050270 Horn et al. Mar 2005 A1
20050053073 Kloth et al. Mar 2005 A1
20050055428 Terai et al. Mar 2005 A1
20050060574 Klotz et al. Mar 2005 A1
20050060598 Klotz et al. Mar 2005 A1
20050071851 Opheim Mar 2005 A1
20050076113 Klotz et al. Apr 2005 A1
20050091426 Horn et al. Apr 2005 A1
20050114611 Durham et al. May 2005 A1
20050114615 Ogasawara et al. May 2005 A1
20050117522 Basavaiah et al. Jun 2005 A1
20050117562 Wrenn Jun 2005 A1
20050138287 Ogasawara et al. Jun 2005 A1
20050169188 Cometto et al. Aug 2005 A1
20050185597 Le et al. Aug 2005 A1
20050188170 Yamamoto Aug 2005 A1
20050198523 Shanbhag et al. Sep 2005 A1
20050235072 Smith et al. Oct 2005 A1
20050283658 Clark et al. Dec 2005 A1
20060015861 Takata et al. Jan 2006 A1
20060015928 Setty et al. Jan 2006 A1
20060034302 Peterson Feb 2006 A1
20060045021 Deragon et al. Mar 2006 A1
20060075191 Lolayekar et al. Apr 2006 A1
20060098672 Schzukin et al. May 2006 A1
20060117099 Mogul Jun 2006 A1
20060136684 Le et al. Jun 2006 A1
20060184287 Belady et al. Aug 2006 A1
20060198319 Schondelmayer et al. Sep 2006 A1
20060215297 Kikuchi Sep 2006 A1
20060230227 Ogasawara et al. Oct 2006 A1
20060242332 Johnsen et al. Oct 2006 A1
20060251111 Kloth et al. Nov 2006 A1
20070005297 Beresniewicz et al. Jan 2007 A1
20070067593 Satoyama et al. Mar 2007 A1
20070079068 Draggon Apr 2007 A1
20070091903 Atkinson Apr 2007 A1
20070094465 Sharma et al. Apr 2007 A1
20070101202 Garbow May 2007 A1
20070121519 Cuni et al. May 2007 A1
20070136541 Herz et al. Jun 2007 A1
20070162969 Becker Jul 2007 A1
20070211640 Palacharla et al. Sep 2007 A1
20070214316 Kim Sep 2007 A1
20070250838 Belady et al. Oct 2007 A1
20070258380 Chamdani et al. Nov 2007 A1
20070263545 Foster et al. Nov 2007 A1
20070276884 Hara et al. Nov 2007 A1
20070283059 Ho et al. Dec 2007 A1
20080016412 White et al. Jan 2008 A1
20080034149 Sheen Feb 2008 A1
20080052459 Chang et al. Feb 2008 A1
20080059698 Kabir et al. Mar 2008 A1
20080114933 Ogasawara et al. May 2008 A1
20080126509 Subramanian et al. May 2008 A1
20080126734 Murase May 2008 A1
20080168304 Flynn et al. Jul 2008 A1
20080201616 Ashmore Aug 2008 A1
20080244184 Lewis et al. Oct 2008 A1
20080256082 Davies et al. Oct 2008 A1
20080267217 Colville et al. Oct 2008 A1
20080288661 Galles Nov 2008 A1
20080294888 Ando et al. Nov 2008 A1
20090063766 Matsumura et al. Mar 2009 A1
20090083484 Basham et al. Mar 2009 A1
20090089567 Boland, IV et al. Apr 2009 A1
20090094380 Qiu et al. Apr 2009 A1
20090094664 Butler et al. Apr 2009 A1
20090125694 Innan et al. May 2009 A1
20090193223 Saliba et al. Jul 2009 A1
20090201926 Kagan et al. Aug 2009 A1
20090222733 Basham et al. Sep 2009 A1
20090240873 Yu et al. Sep 2009 A1
20090282471 Green et al. Nov 2009 A1
20090323706 Germain et al. Dec 2009 A1
20100011365 Gerovac et al. Jan 2010 A1
20100030995 Wang et al. Feb 2010 A1
20100046378 Knapp et al. Feb 2010 A1
20100083055 Ozonat Apr 2010 A1
20100174968 Charles et al. Jul 2010 A1
20100318609 Lahiri et al. Dec 2010 A1
20100318837 Murphy et al. Dec 2010 A1
20110010394 Carew et al. Jan 2011 A1
20110022691 Banerjee et al. Jan 2011 A1
20110029824 Schöler et al. Feb 2011 A1
20110035494 Pandey et al. Feb 2011 A1
20110075667 Li et al. Mar 2011 A1
20110087848 Trent Apr 2011 A1
20110119556 de Buen May 2011 A1
20110142053 Van Der Merwe et al. Jun 2011 A1
20110161496 Nicklin Jun 2011 A1
20110173303 Rider Jul 2011 A1
20110178996 Pendlebury Jul 2011 A1
20110228679 Varma et al. Sep 2011 A1
20110231899 Puller et al. Sep 2011 A1
20110239039 Dieffenbach et al. Sep 2011 A1
20110252274 Kawaguchi et al. Oct 2011 A1
20110255540 Mizrahi et al. Oct 2011 A1
20110276584 Cotner et al. Nov 2011 A1
20110276675 Singh et al. Nov 2011 A1
20110276951 Jain Nov 2011 A1
20110299539 Rajagopal et al. Dec 2011 A1
20110307450 Hahn et al. Dec 2011 A1
20110313973 Srivas et al. Dec 2011 A1
20120023319 Chin et al. Jan 2012 A1
20120030401 Cowan et al. Feb 2012 A1
20120054367 Ramakrishnan et al. Mar 2012 A1
20120072578 Alam Mar 2012 A1
20120072985 Davne et al. Mar 2012 A1
20120075999 Ko et al. Mar 2012 A1
20120084445 Brock et al. Apr 2012 A1
20120084782 Chou et al. Apr 2012 A1
20120096134 Suit Apr 2012 A1
20120130874 Mane et al. May 2012 A1
20120131174 Ferris et al. May 2012 A1
20120134672 Banerjee May 2012 A1
20120144014 Natham et al. Jun 2012 A1
20120159112 Tokusho et al. Jun 2012 A1
20120167094 Suit Jun 2012 A1
20120173581 Hartig et al. Jul 2012 A1
20120173589 Kwon et al. Jul 2012 A1
20120177039 Berman Jul 2012 A1
20120177041 Berman Jul 2012 A1
20120177042 Berman Jul 2012 A1
20120177043 Berman Jul 2012 A1
20120177044 Berman Jul 2012 A1
20120177045 Berman Jul 2012 A1
20120177370 Berman Jul 2012 A1
20120179909 Sagi et al. Jul 2012 A1
20120201138 Yu et al. Aug 2012 A1
20120210041 Flynn et al. Aug 2012 A1
20120254440 Wang Oct 2012 A1
20120257501 Kucharczyk Oct 2012 A1
20120265976 Spiers et al. Oct 2012 A1
20120281706 Agarwal et al. Nov 2012 A1
20120297088 Wang et al. Nov 2012 A1
20120303618 Dutta et al. Nov 2012 A1
20120311106 Morgan Dec 2012 A1
20120311568 Jansen Dec 2012 A1
20120320788 Venkataramanan et al. Dec 2012 A1
20120324114 Dutta et al. Dec 2012 A1
20120331119 Bose et al. Dec 2012 A1
20130003737 Sinicrope Jan 2013 A1
20130013570 Yamakawa Jan 2013 A1
20130013664 Baird et al. Jan 2013 A1
20130028135 Berman Jan 2013 A1
20130036212 Jibbe et al. Feb 2013 A1
20130036213 Hasan et al. Feb 2013 A1
20130036449 Mukkara et al. Feb 2013 A1
20130054888 Bhat et al. Feb 2013 A1
20130061089 Valiyaparambil et al. Mar 2013 A1
20130067162 Jayaraman et al. Mar 2013 A1
20130080823 Roth et al. Mar 2013 A1
20130086340 Fleming et al. Apr 2013 A1
20130100858 Kamath et al. Apr 2013 A1
20130111540 Sabin May 2013 A1
20130138816 Kuo et al. May 2013 A1
20130138836 Cohen et al. May 2013 A1
20130139138 Kakos May 2013 A1
20130144933 Hinni et al. Jun 2013 A1
20130152076 Patel Jun 2013 A1
20130152175 Hromoko et al. Jun 2013 A1
20130163426 Beliveau et al. Jun 2013 A1
20130163606 Bagepalli et al. Jun 2013 A1
20130179941 McGloin et al. Jul 2013 A1
20130182712 Aguayo et al. Jul 2013 A1
20130185433 Zhu et al. Jul 2013 A1
20130191106 Kephart et al. Jul 2013 A1
20130198730 Munireddy et al. Aug 2013 A1
20130208888 Agrawal et al. Aug 2013 A1
20130212130 Rahnama Aug 2013 A1
20130223236 Dickey Aug 2013 A1
20130238641 Mandelstein et al. Sep 2013 A1
20130266307 Garg et al. Oct 2013 A1
20130268922 Tiwari et al. Oct 2013 A1
20130275470 Cao et al. Oct 2013 A1
20130297655 Narasayya et al. Nov 2013 A1
20130297769 Chang et al. Nov 2013 A1
20130318134 Bolik et al. Nov 2013 A1
20130318288 Khan et al. Nov 2013 A1
20140006708 Huynh et al. Jan 2014 A1
20140016493 Johnsson et al. Jan 2014 A1
20140019684 Wei et al. Jan 2014 A1
20140025770 Warfield et al. Jan 2014 A1
20140029441 Nydell Jan 2014 A1
20140029442 Wallman Jan 2014 A1
20140039683 Zimmermann et al. Feb 2014 A1
20140040473 Ho et al. Feb 2014 A1
20140040883 Tompkins Feb 2014 A1
20140047201 Mehta Feb 2014 A1
20140053264 Dubrovsky et al. Feb 2014 A1
20140059187 Rosset et al. Feb 2014 A1
20140059266 Ben-Michael et al. Feb 2014 A1
20140086253 Yong Mar 2014 A1
20140089273 Borshack et al. Mar 2014 A1
20140095556 Lee et al. Apr 2014 A1
20140096249 Dupont et al. Apr 2014 A1
20140105009 Vos et al. Apr 2014 A1
20140108474 David et al. Apr 2014 A1
20140109071 Ding et al. Apr 2014 A1
20140112122 Kapadia et al. Apr 2014 A1
20140123207 Agarwal et al. May 2014 A1
20140156557 Zeng et al. Jun 2014 A1
20140164666 Yand Jun 2014 A1
20140164866 Bolotov et al. Jun 2014 A1
20140172371 Zhu et al. Jun 2014 A1
20140173060 Jubran et al. Jun 2014 A1
20140173195 Rosset et al. Jun 2014 A1
20140173579 McDonald et al. Jun 2014 A1
20140189278 Peng Jul 2014 A1
20140198794 Mehta et al. Jul 2014 A1
20140211661 Gorkemli et al. Jul 2014 A1
20140215265 Mohanta et al. Jul 2014 A1
20140215590 Brand Jul 2014 A1
20140219086 Cantu' et al. Aug 2014 A1
20140222953 Karve et al. Aug 2014 A1
20140229790 Goss et al. Aug 2014 A1
20140244585 Sivasubramanian et al. Aug 2014 A1
20140244897 Goss et al. Aug 2014 A1
20140245435 Belenky Aug 2014 A1
20140269390 Ciodaru et al. Sep 2014 A1
20140281700 Nagesharao et al. Sep 2014 A1
20140297941 Rajani et al. Oct 2014 A1
20140307578 DeSanti Oct 2014 A1
20140317206 Lomelino et al. Oct 2014 A1
20140324862 Bingham et al. Oct 2014 A1
20140325208 Resch et al. Oct 2014 A1
20140331276 Frascadore et al. Nov 2014 A1
20140348166 Yang et al. Nov 2014 A1
20140355450 Bhikkaji et al. Dec 2014 A1
20140366155 Chang et al. Dec 2014 A1
20140376550 Khan et al. Dec 2014 A1
20150003450 Salam et al. Jan 2015 A1
20150003458 Li et al. Jan 2015 A1
20150003463 Li et al. Jan 2015 A1
20150010001 Duda et al. Jan 2015 A1
20150016461 Qiang Jan 2015 A1
20150030024 Venkataswami et al. Jan 2015 A1
20150046123 Kato Feb 2015 A1
20150063353 Kapadia et al. Mar 2015 A1
20150067001 Koltsidas Mar 2015 A1
20150082432 Eaton et al. Mar 2015 A1
20150092824 Wicker, Jr. et al. Apr 2015 A1
20150120907 Niestemski et al. Apr 2015 A1
20150121131 Kiselev et al. Apr 2015 A1
20150127979 Doppalapudi May 2015 A1
20150142840 Baldwin et al. May 2015 A1
20150169313 Katsura Jun 2015 A1
20150180672 Kuwata Jun 2015 A1
20150207763 Bertran Ortiz et al. Jun 2015 A1
20150205974 Talley et al. Jul 2015 A1
20150222444 Sarkar Aug 2015 A1
20150229546 Somaiya et al. Aug 2015 A1
20150248366 Bergsten et al. Sep 2015 A1
20150248418 Bhardwaj et al. Sep 2015 A1
20150254003 Lee et al. Sep 2015 A1
20150254088 Chou et al. Sep 2015 A1
20150261446 Lee Sep 2015 A1
20150263993 Kuch et al. Sep 2015 A1
20150269048 Marr et al. Sep 2015 A1
20150277804 Arnold et al. Oct 2015 A1
20150281067 Wu Oct 2015 A1
20150303949 Jafarkhani et al. Oct 2015 A1
20150341237 Cuni et al. Nov 2015 A1
20150341239 Bertran Ortiz et al. Nov 2015 A1
20150358136 Medard Dec 2015 A1
20150379150 Duda Dec 2015 A1
20160004611 Lakshman et al. Jan 2016 A1
20160011936 Luby Jan 2016 A1
20160011942 Golbourn et al. Jan 2016 A1
20160054922 Awasthi et al. Feb 2016 A1
20160062820 Jones et al. Mar 2016 A1
20160070652 Sundararaman et al. Mar 2016 A1
20160087885 Tripathi et al. Mar 2016 A1
20160088083 Bharadwaj et al. Mar 2016 A1
20160119159 Zhao et al. Apr 2016 A1
20160119421 Semke et al. Apr 2016 A1
20160139820 Fluman et al. May 2016 A1
20160149639 Pham et al. May 2016 A1
20160205189 Mopur et al. Jul 2016 A1
20160210161 Rosset et al. Jul 2016 A1
20160231928 Lewis et al. Aug 2016 A1
20160274926 Narasimhamurthy et al. Sep 2016 A1
20160285760 Dong Sep 2016 A1
20160292359 Tellis et al. Oct 2016 A1
20160294983 Kliteynik et al. Oct 2016 A1
20160321338 Isherwood Nov 2016 A1
20160334998 George et al. Nov 2016 A1
20160366094 Mason Dec 2016 A1
20160378624 Jenkins, Jr. et al. Dec 2016 A1
20160380694 Guduru Dec 2016 A1
20170010874 Rosset Jan 2017 A1
20170010930 Dutta et al. Jan 2017 A1
20170019475 Metz et al. Jan 2017 A1
20170068630 Iskandar et al. Mar 2017 A1
20170168970 Sajeepa et al. Jun 2017 A1
20170177860 Suarez et al. Jun 2017 A1
20170212858 Chu et al. Jul 2017 A1
20170273019 Park et al. Sep 2017 A1
20170277655 Das et al. Sep 2017 A1
20170337097 Sipos et al. Nov 2017 A1
20170340113 Charest et al. Nov 2017 A1
20170371558 George et al. Dec 2017 A1
20180097707 Wright et al. Apr 2018 A1
Foreign Referenced Citations (9)
Number Date Country
2228719 Sep 2010 EP
2439637 Apr 2012 EP
2680155 Jan 2014 EP
2350028 May 2001 GB
2000-242434 Sep 2000 JP
1566104 Jan 2017 TW
WO 2004077214 Sep 2004 WO
WO 2016003408 Jan 2016 WO
WO 2016003489 Jan 2016 WO
Non-Patent Literature Citations (83)
Entry
Author Unknown, “5 Benefits of a Storage Gateway in the Cloud,” Blog, TwinStrata, Inc., posted Jul. 10, 2012, 4 pages, https://web.archive.org/web/20120725092619/http://blog.twinstrata.com/2012/07/10//5-benefits-of-a-storage-gateway-in-the-cloud.
Author Unknown, “Configuration Interface for IBM System Storage DS5000, IBM DS4000, and IBM DS3000 Systems,” IBM SAN Volume Controller Version 7.1, IBM® System Storage® SAN Volume Controller Information Center, Jun. 16, 2013, 3 pages.
Author Unknown, “Coraid EtherCloud, Software-Defined Storage with Scale-Out Infrastructure,” Solution Brief, 2013, 2 pages, Coraid, Redwood City, California, U.S.A.
Author Unknown, “Coraid Virtual DAS (VDAS) Technology: Eliminate Tradeoffs between DAS and Networked Storage,” Coraid Technology Brief, © 2013 Cora id, Inc., Published on or about Mar. 20, 2013, 2 pages.
Author Unknown, “Creating Performance-based SAN SLAs Using Finisar's NetWisdom” May 2006, 7 pages, Finisar Corporation, Sunnyvale, California, U.S.A.
Author Unknown, “Data Center, Metro Cloud Connectivity: Integrated Metro SAN Connectivity in 16 Gbps Switches,” Brocade Communication Systems, Inc., Apr. 2011, 14 pages.
Author Unknown, “Data Center, SAN Fabric Administration Best Practices Guide, Support Perspective,” Brocade Communication Systems, Inc., May 2013, 21 pages.
Author Unknown, “delphi—Save a CRC value in a file, without altering the actual CRC Checksum?” Stack Overflow, stackoverflow.com, Dec. 23, 2011, XP055130879, 3 pages http://stackoverflow.com/questions/8608219/save-a-crc-value-in-a-file-without-altering-the-actual-crc-checksum.
Author Unknown, “EMC UNISPHERE: Innovative Approach to Managing Low-End and Midrange Storage; Redefining Simplicity in the Entry-Level and Midrange Storage Markets,” Data Sheet, EMC Corporation; published on or about Jan. 4, 2013 [Retrieved and printed Sep. 12, 2013] 6 pages http://www.emc.com/storage/vnx/unisphere.htm.
Author Unknown, “HP XP Array Manager Software—Overview & Features,” Storage Device Management Software; Hewlett-Packard Development Company, 3 pages; © 2013 Hewlett-Packard Development Company, L.P.
Author Unknown, “Joint Cisco and VMWare Solution for Optimizing Virtual Desktop Delivery: Data Center 3.0: Solutions to Accelerate Data Center Virtualization,” Cisco Systems, Inc. and VMware, Inc., Sep. 2008, 10 pages.
Author Unknown, “Network Transformation with Software-Defined Networking and Ethernet Fabrics,” Positioning Paper, 2012, 6 pages, Brocade Communications Systems.
Author Unknown, “Recreating Real Application Traffic in Junosphere Lab,” Solution Brief, Juniper Networks, Dec. 2011, 3 pages.
Author Unknown, “Shunra for HP Softwarer,” Enabiling Confidence in Application Performance Before Deployment, 2010, 2 pages.
Author Unknown, “Software Defined Networking: The New Norm for Networks,” White Paper, Open Networking Foundation, Apr. 13, 2012, 12 pages.
Author Unknown, “Software Defined Storage Networks an Introduction,” White Paper, Doc # 01-000030-001 Rev. A, Dec. 12, 2012, 8 pages; Jeda Networks, Newport Beach, California, U.S.A.
Author Unknown, “Standard RAID Levels,” Wikipedia, the Free Encyclopedia, last updated Jul. 18, 2014, 7 pages; http://en.wikipedia.org/wiki/Standard_RAID_levels.
Author Unknown, “Storage Infrastructure for the Cloud,” Solution Brief, © 2012, 3 pages; coraid, Redwood City, California, U.S.A.
Author Unknown, “Storage Area Network—NPIV: Emulex Virtual HBA and Brocade, Proven Interoperability and Proven Solution,” Technical Brief, Apr. 2008, 4 pages, Emulex and Brocade Communications Systems.
Author Unknown, “The Fundamentals of Software-Defined Storage, Simplicity at Scale for Cloud-Architectures” Solution Brief, 2013, 3 pages; Coraid, Redwood City, California, U.S.A.
Author Unknown, “VirtualWisdom® SAN Performance Probe Family Models: Probe FC8, HD, and HD48,” Virtual Instruments Data Sheet, Apr. 2014 Virtual Instruments. All Rights Reserved; 4 pages.
Author Unknown, “Xgig Analyzer: Quick Start Feature Guide 4.0,” Feb. 2008, 24 pages, Finisar Corporation, Sunnyvale, California, U.S.A.
Author Unknown, “Sun Storage Common Array Manager Installation and Setup Guide,” Software Installation and Setup Guide Version 6.7.x 821-1362-10, Appendix D: Configuring In-Band Management, Sun Oracle; retrieved and printed Sep. 12, 2013, 15 pages.
Author Unknown, “Vblock Solution for SAP: Simplified Provisioning for Operation Efficiency,” VCE White Paper, VCE—The Virtual Computing Environment Company, Aug. 2011, 11 pages.
Berman, Stuart, et al., “Start-Up Jeda Networks in Software Defined Storage Network Technology,” Press Release, Feb. 25, 2013, 2 pages, http://www.storagenewsletter.com/news/startups/jeda-networks.
Borovick, Lucinda, et al., “White Paper, Architecting the Network for the Cloud,” IDC Analyze the Future, Jan. 2011, pp. 1-8.
Chakrabarti, Kaushik, et al., “Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases,” ACM Transactions on Database Systems, vol. 27, No. 2, Jun. 2009, pp. 188-228.
Chandola, Varun, et al., “A Gaussian Process Based Online Change Detection Algorithm for Monitoring Periodic Time Series,” Proceedings of the Eleventh SIAM International Conference on Data Mining, SDM 2011, Apr. 28-30, 2011, 12 pages.
Cisco Systems, Inc. “N-Port Virtualization in the Data Center,” Cisco White Paper, Cisco Systems, Inc., Mar. 2008, 7 pages.
Cisco Systems, Inc., “Best Practices in Deploying Cisco Nexus 1000V Series Switches on Cisco UCS B and C Series Cisco UCS Manager Servers,” White Paper, Cisco Systems, Inc., Apr. 2011, 36 pages.
Cisco Systems, Inc., “Cisco Prime Data Center Network Manager 6.1,” At-A-Glance, © 2012, 3 pages.
Cisco Systems, Inc., “Cisco Prime Data Center Network Manager,” Release 6.1 Data Sheet, © 2012, 10 pages.
Cisco Systems, Inc., “Cisco Unified Network Services: Overcome Obstacles to Cloud-Ready Deployments,” White Paper, Cisco Systems, Inc., Jan. 2011, 6 pages.
Clarke, Alan, et al., “Open Data Center Alliance Usage: Virtual Machine (VM) Interoperability in a Hybrid Cloud Environment Rev. 1.2,” Open Data Center Alliance, Inc., 2013, pp. 1-18.
Cummings, Roger, et al., Fibre Channel—Fabric Generic Requirements (FC-FG), Dec. 4, 1996, 33 pages, American National Standards Institute, Inc., New York, New York, U.S.A.
Farber, Franz, et al. “An In-Memory Database System for Multi-Tenant Applications,” Proceedings of 14th Business, Technology and Web (BTW) Conference on Database Systems for Business, Technology, and Web, Feb. 28-Mar. 4, 2011, 17 pages, University of Kaiserslautern, Germany.
Guo, Chang Jie, et al., “IBM Resarch Report: Data Integration and Composite Business Services, Part 3, Building a Multi-Tenant Data Tier with with [sic] Access Control and Security,” RC24426 (C0711-037), Nov. 19, 2007, 20 pages, IBM.
Hatzieleftheriou, Andromachi, et al., “Host-side Filesystem Journaling for Durable Shared Storage,” 13th USENIX Conference on File and Storage Technologies (FAST '15), Feb. 16-19, 2015, 9 pages; http://www.usenix.org/system/files/conference/fast15/fast15-paper-hatzieleftheriou.pdf.
Hedayat, K., et al., “A Two-Way Active Measurement Protocol (TWAMP),” Network Working Group, RFC 5357, Oct. 2008, 26 pages.
Horn, C., et al., “Online anomaly detection with expert system feedback in social networks,” 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 22-27, 2011, 2 pages, Prague; [Abstract only].
Hosterman, Cody, et al., “Using EMC Symmetrix Storage inVMware vSph ere Environments,” Version 8.0, EMC2Techbooks, EMC Corporation; published on or about Jul. 8, 2008, 314 pages; [Retrieved and printed Sep. 12, 2013].
Hu, Yuchong, et al., “Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding,” University of Science & Technology of China, Feb. 2010, 9 pages.
Keogh, Eamonn, et al., “Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases,” KAIS Long Paper submitted May 16, 2000; 19 pages.
Kolyshkin, Kirill, “Virtualization in Linux,” Sep. 1, 2006, pp. 1-5.
Kovar, Joseph F., “Startup Jeda Networks Takes SDN Approach to Storage Networks,” CRN Press Release, Feb. 22, 2013, 1 page, http://www.crn.com/240149244/printablearticle.htm.
Lampson, Butler, W., et al., “Crash Recovery in a Distributed Data Storage System,” Jun. 1, 1979, 28 pages.
Lewis, Michael E., et al., “Design of an Advanced Development Model Optical Disk-Based Redundant Array of Independent Disks (RAID) High Speed Mass Storage Subsystem,” Final Technical Report, Oct. 1997, pp. 1-211.
Lin, Jessica, “Finding Motifs in Time Series,” SIGKDD'02 Jul. 23,-26, 2002, 11 pages, Edmonton, Alberta, Canada.
Linthicum, David, “VM Import could be a game changer for hybrid clouds”, InfoWorld, Dec. 23, 2010, 4 pages.
Long, Abraham Jr., “Modeling the Reliability of RAID Sets,” Dell Power Solutions, May 2008, 4 pages.
Ma, Ao, et al., “RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures,” FAST '15, 13th USENIX Conference on File and Storage Technologies, Feb. 16-19, 2015, 17 pages, Santa Clara, California, U.S.A.
Mahalingam, M., et al., “Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks,” Independent Submission, RFC 7348, Aug. 2014, 22 pages; http://www.hip.at/doc/rfc/rfc7348.html.
McQuerry, Steve, “Cisco UCS M-Series Modular Servers for Cloud-Scale Workloads,” White Paper, Cisco Systems, Inc., Sep. 2014, 11 pages.
Monia, Charles, et al., IFCP—A Protocol for Internet Fibre Channel Networking, draft-monia-ips-ifcp-00.txt, Dec. 12, 2000, 6 pages.
Mueen, Abdullah, et al., “Online Discovery and Maintenance of Time Series Motifs,” KDD'10 The 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 25-28, 2010, 10 pages, Washington, DC, U.S.A.
Muglia, Bob, “Decoding SDN,” Jan. 14, 2013, Juniper Networks, pp. 1-7, http://forums.juniper.net/15/The-New-Network/Decoding-SDN/ba-p/174651.
Murray, Joseph F., et al., “Machine Learning Methods for Predicting Failures in Hard Drives: A Multiple-Instance Application,” Journal of Machine Learning Research 6 (2005), pp. 783-816; May 2005, 34 pages.
Nelson, Mark, “File Verification Using CRC,” Dr. Dobb's Journal, May 1, 1992, pp. 1-18, XP055130883.
Pace, Alberto, “Technologies for Large Data Management in Scientific Computing,” International Journal of Modern Physics C., vol. 25, No. 2, Feb. 2014, 72 pages.
Pinheiro, Eduardo, et al., “Failure Trends in a Large Disk Drive Population,” Fast '07, 5th USENIX Conference on File and Storage Technologies, Feb. 13-16, 2007, 13 pages, San Jose, California, U.S.A.
Raginsky, Maxim, et al., “Sequential Anomaly Detection in the Presence of Noise and Limited Feedback,” arXiv:0911.2904v4 [cs.LG] Mar. 13, 2012, 19 pages.
Saidi, Ali G., et al., “Performance Validation of Network-Intensive Workloads on a Full-System Simulator,” Interaction between Operating System and Computer Architecture Workshop, (IOSCA 2005), Austin, Texas, Oct. 2005, 10 pages.
Sajassi, A., et al., “BGP MPLS Based Ethernet VPN,” Network Working Group, Oct. 18, 2014, 52 pages.
Sajassi, Ali, et al., “A Network Virtualization Overlay Solution using EVPN,” L2VPN Workgroup, Nov. 10, 2014, 24 pages; http://tools.ietf.org/pdf/draft-ietf-bess-evpn-overlay-00.pdf.
Sajassi, Ali, et al., “Integrated Routing and Bridging in EVPN,” L2VPN Workgroup, Nov. 11, 2014, 26 pages; http://tools/ietf.org/pdf/draft-ietf-bess-evpn-inter-subnet-forwarding-00.pdf.
Schroeder, Bianca, et al., “Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?” FAST '07: 5th USENIX Conference on File and Storage Technologies, Feb. 13-16, 2007, 16 pages, San Jose, California, U.S.A.
Shue, David, et al., “Performance Isolation and Fairness for Multi-Tenant Cloud Storage,” USENIX Association, 10th USENIX Symposium on Operating Systems Design Implementation (OSDI '12), 2012, 14 pages; https://www.usenix.org/system/files/conference/osdi12/osdi12-final-215.pdf.
Staimer, Marc, “Inside Cisco Systems' Unified Computing System,” Dragon Slayer Consulting, Jul. 2009, 5 pages.
Swami, Vijay, “Simplifying SAN Management for VMWare Boot from SAN, Utilizing Cisco UCS and Palo,” posted May 31, 2011, 6 pages.
Tate, Jon, et al., “Introduction to Storage Area Networks and System Networking,” Dec. 2017, 302 pages, ibm.com/redbooks.
Vuppala, Vibhavasu, et al., “Layer-3 Switching Using Virtual Network Ports,” Computer Communications and Networks, 1999, Proceedings, Eight International Conference in Boston, MA, USA, Oct. 11-13, 1999, Piscataway, NJ, USA, IEEE, ISBN: 0-7803-5794-9, pp. 642-648.
Wang, Feng, et al. “OBFS: A File System for Object-Based Storage Devices,” Storage System Research Center, MSST. vol. 4., Apr. 2004, 18 pages.
Weil, Sage A., “CEPH: Reliable, Scalable, and High-Performance Distributed Storage,” Dec. 2007, 239 pages, University of California, Santa Cruz.
Weil, Sage A., et al. “CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data.” Proceedings of the 2006 ACM/IEEE conference on Supercomputing. ACM, Nov. 11, 2006, 12 pages.
Weil, Sage A., et al. “Ceph: A Scalable, High-performance Distributed File System,” Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, Nov. 6, 2006, 14 pages.
Wu, Joel, et al., “The Design, and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device,” Department of Computer Science, MSST, May 17, 2006, 25 pages; http://storageconference.us/2006/Presentations/30Wu.pdf.
Xue, Chendi, et al. “A Standard framework for Ceph performance profiling with latency breakdown,” CEPH, Jun. 30, 2015, 3 pages.
Zhou, Zihan, et al., “Stable Principal Component Pursuit,” arXiv:1001.2363v1 [cs.IT], Jan. 14, 2010, 5 pages.
Zhu, Yunfeng, et al., “A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes,” University of Science & Technology of China, 2012, 12 pages.
Extended European Search Report dated Jun. 8, 2018, 6 pages, from the European Patent Office for corresponding EP Application No. 18150944.9.
Stamey, John, et al., “Client-Side Dynamic Metadata in Web 2.0,” SIGDOC '07, Oct. 22-24, 2007, pp. 155-161.
Aweya, James, et al., “Multi-level active queue management with dynamic thresholds,” Elsevier, Computer Communications 25 (2002) pp. 756-771.
Petersen, Chris, “Introducing Lightning: A flexible NVMe JBOF,” Mar. 9, 2016, 6 pages.
Related Publications (1)
Number Date Country
20180203866 A1 Jul 2018 US