A storage arrangement can include a cluster of computer nodes that manage access of data in a shared storage system that is shared by the cluster of computer nodes. Each computer node of the cluster of computer nodes can execute one or more virtual processors, with each virtual processor managing access to a respective data portion in the shared storage system.
Some implementations of the present disclosure are described with respect to the following figures.
Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements. The figures are not necessarily to scale, and the size of some parts may be exaggerated to more clearly illustrate the example shown. Moreover, the drawings provide examples and/or implementations consistent with the description; however, the description is not limited to the examples and/or implementations provided in the drawings.
A “virtual processor” can refer to a computing entity implemented with machine-readable instructions that are executable on a computer node. By managing access of different data portions using respective different virtual processors executed in a cluster of computer nodes, data throughput can be improved when the virtual processors access data in parallel from a shared storage system. A “shared” storage system is a storage system that is shared (i.e., accessible) by any computer node of the cluster of computer nodes.
A data access request (write request or read request) can be received at a given computer node of the cluster of computer nodes. A virtual processor in the given computer node can be assigned to handle the data access request. The virtual processor assigned to handle the data access request may be referred to as a “source virtual processor” with respect to the data access request. In some examples, the source virtual processor may not be the virtual processor that “owns” (i.e., manages access and/or updates to) metadata for the data that is the subject of the data access request. The virtual processor that owns metadata for data that is the subject of a given data access request may be referred to as a “metadata virtual processor” with respect to the given data access request. A virtual processor may also “own” a data object; such a virtual processor is responsible for managing the access and/or updates of the data object.
To process a data access request (such as a request to access a data object), the source virtual processor determines which virtual processor is the metadata virtual processor, and obtains the metadata from the metadata virtual processor. An example of the metadata may include a list of chunk identifiers of chunks that make up the data object. The list of chunk identifiers can be used by the source virtual processor to retrieve the chunks of the data object. Another example of the metadata can include a version of the data object.
For load balancing and improved throughput, metadata for respective data objects can be partitioned into multiple partitions that are spread across multiple virtual processors executing in a cluster of computer nodes. Each computer node can execute one or more virtual processors, and each virtual processor may include one or more partitions. A virtual processor “including” a partition can refer to the virtual processor owning a portion of the metadata (referred to as “metadata portion”) in the partition.
For load balancing reasons and to address faults or failures in computer nodes of the cluster of computer nodes, a virtual processor may be migrated from a source computer node to a target computer node. However, virtual processor migration can lead to processing overhead related to maintaining associations of metadata portions with respective virtual processors. An association between a metadata portion and a given virtual processor can be represented by mapping information. If the associations between metadata portions and virtual processors are not properly maintained in response to migrations of virtual processors, then source virtual processors may have difficulty finding metadata virtual processors when handling incoming data access requests.
Also, in examples where data is stored in the shared storage system in the form of data buckets, a further challenge relates to how metadata for the data buckets are sharded across the cluster of computer nodes as the cluster changes over time (such as due to additions of computer nodes to the cluster).
In accordance with some implementations of the present disclosure, a mapping scheme is provided that uses a partition map and a virtual processor-computer node (VP-CN) map. The partition map associates (maps) partitions (that include metadata portions) to respective virtual processors. The VP-CN map associates (maps) virtual processors to respective computer nodes of the cluster of computer nodes. When a virtual processor is migrated between computer nodes, the VP-CN node map is updated, but the partition map does not change. As a result, requests for data objects that have keys that map to a given virtual processor would continue to map to the given virtual processor after the migration of the given virtual processor between different computer nodes. The static nature of the partition map in the context of virtual processor migrations allows for a system including the cluster of computer nodes to deterministically map metadata of data objects to virtual processors.
In addition, data buckets can be received at different times. A first data bucket may be generated or received when the cluster of computer nodes has a first quantity of computer nodes. Later, one or more computer nodes may be added to the cluster. Even though the cluster of computer nodes has been expanded, the partition map and the VP-CN node map for the first data bucket are not changed, which avoids the burden associated with having to update mappings as the cluster of computer nodes expands. If a second data bucket is generated or received after the expansion of the cluster of computer nodes, a partition map may be created for the second data bucket that make use of the increased quantity of computer nodes. Note that there may be one partition map per data bucket (so that there are multiple partition maps for respective data buckets). However, there is one VP-CN node map that is common for multiple data buckets.
The cluster 100 of computer nodes is able to manage the access of data stored in a shared storage system 102 in response to data access requests received over a network 112 from requester devices 104. As used here, a “requester device” can refer to any electronic device that is able to send a request to access data (read data or write data). Examples of electronic devices include any or some combination of the following: desktop computers, notebook computers, tablet computers, server computers, game appliances, Internet-of-Things (IoT) devices, vehicles, household appliances, and so forth. Examples of the network 112 can include any or some combination of the following: a storage area network (SAN), a local area network (LAN), a wide area network (WAN), and so forth.
The shared storage system 102 is accessible by any of the computer nodes CN1 to CN3 over a communication link 106 between the cluster 100 of computer nodes and the shared storage system 102. The shared storage system 102 is implemented using a collection of storage devices 108. As used here, a “collection” of items can refer to a single item or to multiple items. Thus, the collection of storage devices 108 can include a single storage device or multiple storage devices. Examples of storage devices can include any or some combination of the following: disk-based storage devices, solid state drives, and so forth.
The requester devices 104 can send data access requests to any of the computer nodes CN1 to CN3 over the network 112. Each computer node executes a collection of virtual processors (a single virtual processor or multiple virtual processors). A virtual processor is executed by a processing resource of a computer node.
In the example of
Virtual processors can also be migrated between computer nodes. For example, to achieve load balancing or for fault tolerance or recovery, a first virtual processor may be migrated from a current (source) computer node to a destination computer node. Once migrated, the first virtual processor executes at the destination computer node.
In some examples, data to be stored in the shared storage system 102 by virtual processors can be part of data buckets. A data bucket can refer to any type of container that includes a collection of data objects (a single data object or multiple data objects). A specific example of a data bucket is an S3 bucket in an Amazon cloud storage. In other examples, other types of data buckets can be employed.
A data object can be divided into a collection of data chunks (a single data chunk or multiple data chunks). Each data chunk (or more simply “chunk”) has a specified size (a static size or a size that can dynamically change). The storage locations of the chunks are storage locations in the shared storage system 102.
Each of the virtual processors VP1 to VP6 may maintain respective metadata associated with data objects stored or to be stored in the shared storage system 102. In some examples, the metadata can include data object metadata such as a list of chunk identifiers (IDs) that identify chunks of a data object, and a virtual processor ID that identifies a virtual processor. The list of chunk IDs can include a single chunk ID or multiple chunk IDs. The data object metadata can further include a version ID that represents a version of a data object. As a data object is modified by write request(s), corresponding different versions of the data object are created (and identified by respective version IDs). As used here, an “ID” can refer to any information (e.g., a name, a string, etc.) that can be used to distinguish one item from another item (e.g., distinguish between chunks, or distinguish between data objects, or distinguish between versions of a data object, or distinguish between virtual processors, and so forth).
The metadata may further include storage location metadata representing storage locations of chunks of data objects in the shared storage system 102. For example, the storage location metadata can include any or some combination of the following: an offset, a storage address, a block number, and so forth.
The metadata may also include commit metadata maintained by a metadata virtual processor during write operations and read operations. The commit metadata indicates whether a write of a subject data object is in progress (e.g., at a virtual processor that is assigned to handle the write) or a write of the subject data object is no longer in progress (i.e., a write of the subject data object is complete). Note that a write of the subject data object is complete if the subject data object has been written to either a write buffer (not shown) or the shared storage system 102. A write buffer can be part of an NV memory (e.g., any of 110-1 to 110-3) and is associated with a respective virtual processor for caching write data.
In other examples, additional or alternative metadata may be employed.
As noted above, metadata can be partitioned into multiple partitions that are spread across multiple virtual processors executing in the cluster 100 of computer nodes. Each virtual processor may include one or more partitions. A virtual processor “including” a partition can refer to the virtual processor owning a metadata portion in the partition.
As shown in
Even though the partitions are shown as being inside respective virtual processors in
Also, although each virtual processor is depicted as including a specific quantity of partitions, in other examples, a virtual processor can include a different quantity of partitions for any given data bucket.
Each virtual processor is responsible for managing respective metadata of one or more partitions. In an example, there are M (M≥2) partitions and P (P≥2) virtual processors. In such an example, each virtual processor of the P virtual processors is responsible for managing metadata of M/P partitions.
A partition is identified by a Partition ID. In some examples, a Partition ID is based on applying a hash function (e.g., a cryptographic hash function or another type of hash function) on information (key) associated with a data object. In some examples, the key associated with the data object on which the hash function is applied includes a Bucket ID identifying the data bucket of which the data object is part, and an Object ID that identifies the data object. The hash function applied on the key associated with the data object produces a hash value that is the Partition ID or from which the Partition ID is derived.
For example, a hash value (Hash-Value) is computed as follows:
where F(x) is a transformation applied on x (in this case a transformation applied on Request.key), Hash( ) is a hash function, and Request.key includes the Bucket ID and the Object ID of a data access request (a read request or write request). The transformation can be a concatenation operation of Bucket ID and Object ID, for example.
The Partition ID is computed based on the Hash-Value as follows:
where % is a modulus operation, and #Partitions is a constant (which can be a configured value) that represents how many partitions are to be divided across a cluster of computer nodes. The modulus operation computes the remainder after dividing Hash-Value by #Partitions.
As each data bucket is created, the metadata for each data bucket is sharded across multiple partitions. A given partition defines a subset of the object key-space for a specific data bucket and maps to one virtual processor. The object key-space for a data bucket refers to the possible range of values associated with the key for data objects in the data bucket, such as by Hash-Value-Range above. The entire object key-space can be uniformly distributed among the partitions, which can lead to equal size partitions. A last partition may be larger or smaller if the object key-space is not exactly divisible by the number of partitions.
As shown in
In some examples, the quantity of partitions for a data bucket is a configurable constant (e.g., #Partitions in Eq. 2 above) across all buckets-in other words, each data bucket of multiple data buckets to be stored in the shared storage system 102 can have a static quantity of partitions. In the example of
For an even distribution of partitions across virtual processors, the configurable constant (e.g., #Partitions in Eq. 2) may be a highly composite number with a relatively large quantity of divisors. For example, the number 48 is divisible by the following divisors: 1, 2, 3, 4, 6, 8, 12, 16, 24, 48. When a data bucket is created, the data bucket is associated with the static quantity of partitions, which are spread out as uniformly as possible across all available virtual processors in the cluster 100 at the time the data bucket is created.
Data bucket B1 is created when the cluster 100 is configured with the computer nodes CN1 and CN2 at time T1. In some examples, the creation of data bucket B1 is triggered by a user, such as at a requester device 104. In other examples, another entity (a program or a machine) can trigger the creation of data bucket B1. Creation of a data bucket can be triggered in response to a request or any other type of input from an entity (user, program, or machine).
In some examples, the request to create data bucket B1 can be received at any of the computer nodes of the cluster 100. The computer node that received the request creates a partition map. Each computer node includes a map creation engine that creates the partition map. In the example of
As used here, an “engine” can refer to one or more hardware processing circuits, which can include any or some combination of a microprocessor, a core of a multi-core microprocessor, a microcontroller, a programmable integrated circuit, a programmable gate array, or another hardware processing circuit. Alternatively, an “engine” can refer to a combination of one or more hardware processing circuits and machine-readable instructions (software and/or firmware) executable on the one or more hardware processing circuits.
It is assumed that the computer node CN1 received the request to create data bucket B1. In this example, the map creation engine 202-1 responds to the request to create data bucket B1 by creating a B1 partition map 204 that maps partitions containing metadata for data objects of data bucket B1 to respective virtual processors. In the example of
The B1 partition map 204 includes a number of entries that is equal to #Partitions. If #Partitions is 12, then the B1 partition map 204 includes 12 entries. Each entry of the 12 entries corresponds to a respective partition of the 12 partitions. Thus, the first entry of the B1 partition map 204 corresponds to partition P11, the second entry of the B1 partition map 204 corresponds to partition P12, the third entry of the B1 partition map 204 corresponds to partition P13, the fourth entry of the B1 partition map 204 corresponds to partition P21, and so forth. Each entry of the B1 partition map 204 contains an identifier of a virtual processor that is associated with the corresponding partition. Thus, in the example of
Once created by the map creation engine 202-1, the B1 partition map 204 can be stored (persisted) in the shared storage system 102 by the computer node CN1 that processed the request to create data bucket B1. The other computer nodes, such as CN2, can read the B1 partition map 204 from the shared storage system 102 (such as while processing write and read requests) and cache a copy of the B1 partition map 204 in the NV memory 110-2.
Note that each computer node CN1 or CN2 also includes a VP-CN map 206 that is created by a map creation engine (same as or different from the map creation engine used to create a partition map). The VP-CN map 206 associates virtual processors to computer nodes that are present in the cluster 100 at time T1. The first entry of the VP-CN map 206 maps VP1 to CN1, the second entry of the VP-CN map 206 maps VP2 to CN1, the third entry of the VP-CN map 206 maps VP3 to CN2, and the fourth entry of the VP-CN map 206 maps VP4 to CN2.
If the configuration of the cluster 100 is changed (by adding or removing computer nodes), then the VP-CN map 206 can be modified to change the mapping of virtual processors and computer nodes.
In further examples, the map creation engine(s) to create partition maps and/or a VP-CN maps can be in a computer system separate from the cluster 100 of the computer nodes.
After the third computer node CN3 has been added, data bucket B2 is created. The metadata for the data objects of data bucket B2 is divided into 12 partitions (#Partitions=12) across the virtual processors VP1 to VP6 of the three computer nodes CN1, CN2, and CN3. As a result, two partitions are included in each virtual processor for data bucket B2.
In response to the creation of data bucket B2, a map creation engine (any of 202-1, 202-2, and 202-3 in the computer node that received the request to create data bucket B2) creates a B2 partition map 208. The 12 entries of the B2 partition map 208 correspond to the 12 partitions P1A, P1B, P2A, P2B, P3A, P3B, P4A, P4B, P5A, P5B, P6A, and P6B, respectively. The 12 entries of the B2 partition map 208 associate partitions to virtual processors as follows: the first entry maps P1A to VP1, the second entry maps P1B to VP1, the third entry maps P2A to VP2, the fourth entry maps P2B to VP2, the fifth entry maps P3A to VP3, the sixth entry maps P3B to VP3, the seventh entry maps P4A to VP4, the eighth entry maps P4B to VP4, the ninth entry maps P5A to VP5, the tenth entry maps P5B to VP5, the eleventh entry maps P6A to VP6, and the twelfth entry maps P6B to VP6.
Once created, the B2 partition map 208 can be written by the computer node at which the B2 partition map 208 was created to the shared storage system 102, and the other two computer nodes can read the B2 partition map 208 from the shared storage system 102 to cache respective copies of the B2 partition map 208 in the respective NV memories. Note that each of the computer nodes CN1 and CN2 includes the following maps: the B1 partition map 204, the VP-CN map 206, and the B2 partition map 208. The computer node CN3 includes the VP-CN map 206 and the B2 partition map 208, but does not include the B1 partition map 204.
In some examples, the partition maps (204 and 208) once assigned can remain static as the cluster 100 of computer nodes is expanded by adding more computer nodes. The partition maps 204 and 208 and the VP-CN map 206 may be stored in a repository (e.g., in the shared storage system), with copies of the maps 204, 206, and 208 cached in the computer nodes CN1, CN2, and CN3, such as in the respective NV memories 110-1, 110-2, and 110-3.
The computer node that received the incoming write request is referred to as the “receiving computer node.” In examples where the receiving computer node executes multiple virtual processors, the receiving computer node selects one of the multiple virtual processors as a source virtual processor 302 to handle the incoming write request. The selection can be a random selection process in which the source virtual processor 302 is randomly selected from the multiple virtual processors in the receiving computer node. In another example, the selection process can be based on determining relative loads of the virtual processors in the receiving computer node, and selecting a virtual processor with the least load to be the source virtual processor 302.
In response to the write request, the receiving computer node determines which of the virtual processors VP1 to VP6 in the cluster 100 of computer nodes is a metadata virtual processor 304 for the subject data object of the incoming write request 120. The receiving computer node can apply a hash function or another type of function on a key associated with the subject data object, to produce a Partition ID that identifies the partition that the metadata for the subject data object is part of. The key includes a Bucket ID identifying the data bucket that the subject data object is part of, and an Object ID that identifies the subject data object. Once the Partition ID is obtained, the source virtual processor 302 accesses (at 306) a partition map for the data bucket identified by the Bucket ID to determine which virtual processor (the metadata virtual processor 304) is mapped to the partition identified by the Partition ID. This virtual processor is the metadata virtual processor 304.
The source virtual processor 302 also accesses (at 308) the VP-CN map (e.g., 206) to identify which computer node the metadata virtual processor 304 executes in. The source virtual processor 302 sends (at 310) a control message to the metadata virtual processor 306 in the identified computer node. The control message contains the Bucket ID, Partition ID, Object ID, and Version ID of the subject data object. The control message is to indicate to the metadata virtual processor 306 that the source virtual processor 302 is ready to start the write operation for the incoming write request.
In response to the control message, the metadata virtual processor 306 generates (at 312) a list of chunk IDs that identifies one or more data chunks for the subject data object. The chunk ID(s) generated depend(s) on the size of the subject data object—the size of the subject data object determines how many data chunks are to be divided from the subject data object. The metadata virtual processor 306 sends (at 314) the list of chunk IDs to the source virtual processor 302.
In response to receiving the list of chunk IDs from the metadata virtual processor 306, the source virtual processor 302 writes (at 316) the chunk(s) of the subject data object using the chunk ID(s) included in the list of chunk IDs. The source virtual processor 302 can write the data chunk(s) to a write buffer (not shown) associated with the source virtual processor 302 and/or to the shared storage system 102.
Note that rebalancing of partitions across computer nodes may occur if one or more computer nodes are removed from the cluster 100. The partitions of the removed one or more computer nodes can be added to virtual processors of the remaining computer nodes of the cluster 100.
Rebalancing of partitions across computer nodes of the cluster 100 may also occur in response to another condition, such as overloading or a fault of a virtual processor or a computer node. For example, if a given virtual processor is overloaded, then a partition can be moved from the given virtual processor to another virtual processor. Rebalancing one or more partitions from heavily loaded virtual processors to more lightly loaded virtual processors can reduce metadata hot-spots. A metadata hot-spot can occur at a given virtual processor if there is a large number of requests for metadata associated with subject data objects of write requests from other virtual processors to the given virtual processor.
If a virtual processor or a computer node is heavily loaded (due to performing a large quantity of operations as compared to virtual processors on other computer nodes), then one or more partitions of the heavily loaded virtual processor or computer node can be moved to one or more other virtual processors (which can be on the same or a different computer node). Moving a partition from a first virtual processor to a second virtual processor results in the first virtual processor no longer owning the metadata of the partition, and the second virtual processor owning the metadata of the partition.
As an example, in
The movement of a partition between different computer nodes can be accomplished with reduced overhead since data in data objects does not have to be moved between computer nodes as a result of the movement of the partition. For example, when the partition P11 is moved from the virtual processor VP1 to the virtual processor VP3, the data objects associated with the metadata in the moved partition do not have to be moved between the computer nodes CN1 and CN2. If the data objects associated with the metadata in the moved partition P11 are in the write buffer of the computer node CN1, the data objects can be flushed to the shared storage system 102 (but do not have to be copied to the computer node CN2). If the data objects associated with the metadata in the moved partition P11 are already in the shared storage system 102, then no movement of the data objects occur in response to movement of the partition P11.
In some examples, if the data objects associated with the metadata in the moved partition P11 are arranged in a hierarchical structure, then a root or other higher-level element may be moved from the computer node CN1 to the computer node CN2. An example of such a higher-level element is a superblock, which contains storage locations of the data objects. However, the amount of information in the superblock is much less than the data contained in the data objects referred to by the superblock, so the movement of the superblock between computer nodes has a relatively low overhead in terms of resources used.
Note that rebalancing can also be accomplished by moving a virtual processor (and its included partitions) between different computer nodes. The migration of a given virtual processor from a source computer node to a target computer node can similarly be accomplished without moving data of associated data objects between the source and target computer nodes, although superblocks or other higher-level elements may be moved.
In some examples, prefix-based hashing is applied on a prefix of a key of a data access request to produce a hash value from which a Partition ID is derived. In such examples, Eq. 1 is modified to produce a hash value (Hash-Value) as follows:
where Prefix of Request.key represents the prefix of the key.
The prefix has a specified prefix length. In some examples, different data buckets can be associated with respective prefix lengths, at least some of which may be different from one another.
For example,
More generally, each prefix length in the configuration information 400 is specified for a respective collection of data buckets (a single data bucket or multiple data buckets).
The ability to specify different prefix lengths for different data buckets (or more generally different collections of data buckets) enhances flexibility in how metadata for the different data buckets are sharded across virtual processors in the cluster 100 of computer nodes. For example, the different data buckets (or collections of data buckets) may be associated with different applications.
The computation of hash values from which Partition IDs are determined based on prefixes of keys (rather than the entirety of the keys) can allow for greater efficiency when performing range queries with respect to data objects. A “range query” is a query for a collection of data objects with keys within a specified range.
In an example, it is assumed that a given data bucket contains R (R≥2) data objects, where the keys of each of the data objects in the given data bucket start with “/myprefix/”, which is a combination of the Bucket ID and a portion (less than the entirety of the Object ID of a data object in the given data bucket. In this example, data object 1 has key “/myprefix/foo”, data object 2 has key “/myprefix/bar, . . . , and data object N has key “/myprefix/foobar”.
Further, assume the prefix length for the given data bucket specified by the configuration information 400 is the length of the string “/myprefix/”. As a result, when computing the hash value according to Eq. 3 based on the key of each respective data object in the given data bucket, the hash function is applied on the same string “/myprefix/” of each key of the respective data object. The same Partition ID will be generated for each data object of the given data bucket, so that the metadata for all of the data objects in the given data bucket will end up in the same partition,
Subsequently, if a user or another entity submits a range query (such as from a requester device 104 of
The machine-readable instructions include partition map creation instructions 502 to create a partition map that maps partitions of a data bucket to respective virtual processors executed in a cluster of computer nodes that are coupled to a shared storage system to store data of the data bucket. The portions of metadata for the data bucket are divided across the partitions. The partition map creation instructions 502 can include instructions of any or some combination of the map creation engines 202-1 to 202-3 of
The machine-readable instructions include partition identification instructions 504 to, responsive to a request to access a data object in the data bucket, identify which partition of the partitions contains metadata for the data object based on a key associated with the data object. In some examples, the identification of the partition is based on applying a function (e.g., a hash function such as in Eq. 1 or 3) on a key associated with the request to access the data object.
The function can be applied on a first portion of the key associated with the data object to obtain a value, which is used to identify the partition that contains the metadata for the data object. In some examples, the first portion of the key is a prefix of the key, where the prefix of the key is less than an entirety of the key.
The machine-readable instructions further include virtual processor identification instructions 506 to identify, based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object. The partition map associates the identified partition with the virtual processor.
The machine-readable instructions further include VP-CN map update instructions 508 to, responsive to a migration of a first virtual processor from a first computer node to a second computer node of the cluster of computer nodes, update a VP-CN map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes. Although the VP-CN map is updated, the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node.
In some examples, the migration of the first virtual processor from the first computer node to the second computer node causes migration of one or more portions of the metadata associated with the first virtual processor from the first computer node to the second computer node. In some examples, the migration of the first virtual processor from the first computer node to the second computer node is performed without performing movement of data owned by the first virtual processor between computer nodes of the cluster of computer nodes.
In some examples, prior to the migration of the first virtual processor, a first request for a first data object associated with a first key maps, based on the partition map, to the first virtual processor on the first computer node. After the migration of the first virtual processor, a second request for the first data object associated with the first key maps, based on the partition map, to the first virtual processor on the second computer node.
In some examples, in response to the first request, the machine-readable instructions obtain metadata for the first data object from the first virtual processor on the first computer node, and in response to the second request, the machine-readable instructions obtain the metadata for the first data object from the first virtual processor on the second computer node.
In some examples, different keys associated with respective data objects that share a same prefix map to a same partition, and the machine-readable instructions perform a range query for a range of keys at one or more virtual processors mapped to a partition associated with the range of keys.
In some examples, in response to detecting a metadata hotspot at a given virtual processor, the machine-readable instructions migrate a first partition from the given virtual processor to another virtual processor, and update the partition map in response to the migration of the first partition.
The system 600 includes a shared storage system 610 accessible by the cluster of computer nodes 602 to store data of the first and second data buckets.
A first virtual processor VP1 in the cluster of computer nodes 602 responds to a request 612 to access a first data object in the first data bucket by identifying a first given partition of the partitions of the first data bucket that contains metadata for the first data object based on a first portion of a first key 614 associated with the first data object. The first virtual processor VP1 identifies, based on the first given partition and using the first partition map 606, a virtual processor that has the metadata for the first data object.
A second virtual processor VP2 in the cluster of computer nodes 602 responds to a request 616 to access a second data object in the second data bucket by identifying a second given partition of the partitions of the second data bucket that contains metadata for the second data object based on a second portion of a second key 618 associated with the second data object. The second portion of the second key 618 is of a different length than the first portion of the first key 614. The second virtual processor VP2 identifies, based on the second given partition and using the second partition map 608, a virtual processor that has the metadata for the second data object.
In some examples, the identifying of the first given partition of the partitions of the first data bucket that contains metadata for the first data object is based on applying a hash function on the first portion of the first key associated with the first data object, and the identifying of the second given partition of the partitions of the second data bucket that contains metadata for the second data object is based on applying the hash function on the second portion of the second key associated with the second data object.
The process 700 includes receiving (at 702) a request to create a data bucket. The process 700 includes creating (at 704), in response to the request to create the data bucket, a partition map that maps partitions of the data bucket to respective virtual processors executed in a cluster of computer nodes that are coupled to a shared storage system to store data of the data bucket, where portions of metadata for the data bucket are divided across the partitions.
The process 700 includes receiving (at 706) a request to access a data object in the data bucket. In response to the request to access the data object in the data bucket, the process 700 identifies (at 708) which partition of the partitions contains metadata for the data object based on applying a function on a key associated with the data object, and identifies (at 710), based on the identified partition and using the partition map, a virtual processor that has the metadata for the data object.
Responsive to a migration of a first virtual processor of the virtual processors from a first computer node to a second computer node of the cluster of computer nodes, the process 700 updates (at 712) a virtual processor-computer node map that maps the respective virtual processors to corresponding computer nodes of the cluster of computer nodes, where the partition map remains unchanged in response to the migration of the first virtual processor from the first computer node to the second computer node.
A storage medium (e.g., 500 in
In the present disclosure, use of the term “a,” “an,” or “the” is intended to include the plural forms as well, unless the context clearly indicates otherwise. Also, the term “includes,” “including,” “comprises,” “comprising,” “have,” or “having” when used in this disclosure specifies the presence of the stated elements, but do not preclude the presence or addition of other elements.
In the foregoing description, numerous details are set forth to provide an understanding of the subject disclosed herein. However, implementations may be practiced without some of these details. Other implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.