This disclosure relates to the storage field, and in particular, to a node capacity expansion method in a storage system and a storage system.
In a distributed storage system, a capacity of the storage system needs to be expanded if a free space of the storage system is insufficient. When a new node is added to the storage system, an original node migrates some partitions and data corresponding to the partitions to the new node. Data migration between storage nodes certainly consumes bandwidth.
This disclosure provides a node capacity expansion method in a storage system and a storage system, to save bandwidth between storage nodes.
According to a first aspect, a node capacity expansion method in a storage system is provided. The storage system includes one or more first nodes. Each first node stores data and metadata of the data. According to the method, a data partition group and a metadata partition group are configured for the first node, where the data partition group includes a plurality of data partitions, the metadata partition group includes a plurality of metadata partitions, and metadata of data corresponding to the data partition group is a subset of metadata corresponding to the metadata partition group. A meaning of the subset is that a quantity of the data partitions included in the data partition group is less than a quantity of the metadata partitions included in the metadata partition group, metadata corresponding to one part of the metadata partitions included in the metadata partition group is used to describe the data corresponding to the data partition group, and metadata corresponding to another part of the metadata partitions is used to describe data corresponding to another data partition group. When a second node is added to the storage system, the first node splits the metadata partition group into at least two metadata partition subgroups, and migrates a first metadata partition subgroup in the at least two metadata partition subgroups and metadata corresponding to the first metadata partition subgroup to the second node.
According to the method provided in the first aspect, when the second node is added, a metadata partition subgroup obtained after splitting by the first node and metadata corresponding to the metadata partition subgroup are migrated to the second node. Because a data volume of the metadata is greatly less than a data volume of the data, compared with migrating the data to the second node in other approaches, this method saves bandwidth between nodes.
In addition, because the data partition group and the metadata partition group of the first node are configured, the metadata of the data corresponding to the configured data partition group is the subset of the metadata corresponding to the metadata partition group. In this case, even if the metadata partition group is split into at least two metadata partition subgroups after capacity expansion, it can still be ensured to some extent that the metadata of the data corresponding to the data partition group is a subset of metadata corresponding to any metadata partition subgroup. After one of the metadata partition subgroups and metadata corresponding to the metadata partition subgroup are migrated to the second node, the data corresponding to the data partition group is still described by metadata stored on a same node. This avoids modifying metadata on different nodes when data is modified especially when junk data collection is performed.
With reference to a first implementation of the first aspect, in a second implementation, the first node obtains a metadata partition group layout after capacity expansion and a metadata partition group layout before capacity expansion. The metadata partition group layout after capacity expansion includes a quantity of the metadata partition subgroups configured for each node in the storage system after the second node is added to the storage system, and a quantity of metadata partitions included in the metadata partition subgroup after the second node is added to the storage system. The metadata partition group layout before capacity expansion includes a quantity of the metadata partition groups configured for the first node before the second node is added to the storage system, and a quantity of metadata partitions included in the metadata partition groups before the second node is added to the storage system. The first node splits the metadata partition group into at least two metadata partition subgroups based on the metadata partition group layout after capacity expansion and the metadata partition group layout before capacity expansion.
With reference to any one of the foregoing implementations of the first aspect, in a third implementation, after the migration, the first node splits the data partition group into at least two data partition subgroups. Metadata of data corresponding to the data partition subgroup is a subset of metadata corresponding to the metadata partition subgroups. Splitting the data partition group into the data partition subgroups of a smaller granularity is to prepare for a next capacity expansion, so that the metadata of the data corresponding to the data partition subgroup is always the subset of the metadata corresponding to the metadata partition subgroups.
With reference to any one of the foregoing implementations of the first aspect, in a fourth implementation, when the second node is added to the storage system, the first node keeps the data partition group and the data corresponding to the data partition group still being stored on the first node. Because only metadata is migrated, data is not migrated, and a data volume of the metadata is usually far less than a data volume of the data, bandwidth between nodes is saved.
With reference to the first implementation of the first aspect, in a fifth implementation, it is clearer that the metadata of the data corresponding to the data partition group is a subset of metadata corresponding to any one of the at least two metadata partition subgroups. In this way, it is ensured that the data corresponding to the data partition group is still described by metadata stored on a same node. This avoids modifying metadata on different nodes when data is modified especially when junk data collection is performed.
According to a second aspect, a node capacity expansion apparatus is provided. The node capacity expansion apparatus is adapted to implement the method provided in any one of the first aspect and the implementations of the first aspect.
According to a third aspect, a storage node is provided. The storage node is adapted to implement the method provided in any one of the first aspect and the implementations of the first aspect.
According to a fourth aspect, a computer program product for a node capacity expansion method is provided. The computer program product includes a computer-readable storage medium that stores program code, and an instruction included in the program code is used to perform the method described in any one of the first aspect and the implementations of the first aspect.
According to a fifth aspect, a storage system is provided. The storage system includes at least a first node and a third node. In addition, in the storage system, data and metadata that describes the data are separately stored on different nodes. For example, the data is stored on the first node, and the metadata of the data is stored on the third node. The first node is adapted to configure a data partition group, and the data partition group corresponds to the data. The third node is adapted to configure a metadata partition group, and metadata of data corresponding to the configured data partition group is a subset of metadata corresponding to the configured metadata partition group. When a second node is added to the storage system, the third node splits the metadata partition group into at least two metadata partition subgroups, and migrates a first metadata partition subgroup in the at least two metadata partition subgroups and metadata corresponding to the first metadata partition subgroup to the second node.
In the storage system provided in the fifth aspect, although the data and the metadata of the data are stored on different nodes, because the data partition group and the metadata partition group of the nodes are configured in a same way as in the first aspect, metadata of data corresponding to any data partition group can still be stored on one node after the migration, and there is no need to obtain or modify the metadata on two nodes.
According to a sixth aspect, a node capacity expansion method is provided. The node capacity expansion method is applied to the storage system provided in the fifth aspect, and the first node in the storage system performs a function provided in the fifth aspect.
According to a seventh aspect, a node capacity expansion apparatus is provided. The node capacity expansion apparatus is located in the storage system provided in the fifth aspect, and is adapted to perform the function provided in the fifth aspect.
According to an eighth aspect, a node capacity expansion method in a storage system is provided. The storage system includes one or more first nodes. Each first node stores data and metadata of the data. In addition, the first node includes at least two metadata partition groups and at least two data partition groups, and metadata corresponding to each metadata partition group is separately used to describe data corresponding to one of the data partition groups. The metadata partition groups and the data partition groups are configured for the first node, so that a quantity of metadata partitions included in the metadata partition groups is equal to a quantity of data partitions included in the data partition group. When a second node is added to the storage system, the first node migrates a first metadata partition group in the at least two metadata partition groups and metadata corresponding to the first metadata partition group to the second node. However, data corresponding to the at least two data partition groups is still stored on the first node.
In the storage system provided in the eighth aspect, after the migration, metadata of data corresponding to any data partition group is stored on one node, and there is no need to obtain or modify the metadata on two nodes.
According to a ninth aspect, a node capacity expansion method is provided. The node capacity expansion method is applied to the storage system provided in the eighth aspect, and the first node in the storage system performs a function provided in the eighth aspect.
According to a tenth aspect, a node capacity expansion apparatus is provided. The node capacity expansion apparatus is located in the storage system provided in the fifth aspect, and is adapted to perform the function provided in the eighth aspect.
In an embodiment of this disclosure, metadata is migrated to a new node during capacity expansion, and data is still stored on an original node. In addition, through configuration, it is always ensured that metadata of data corresponding to a data partition group is a subset of metadata corresponding to a metadata partition group, so that data corresponding to one data partition group is described only by metadata stored on one node. This saves bandwidth. The following describes technical solutions in this disclosure with reference to accompanying drawings.
The technical solutions in the embodiments of this disclosure may be applied to various storage systems. The following describes the technical solutions in the embodiments of this disclosure by using a distributed storage system as an example, but this is not limited in the embodiments of this disclosure. In the distributed storage system, data is separately stored on a plurality of storage nodes, and the plurality of storage nodes share a storage load. This storage mode improves reliability, availability, and access efficiency of a system, and the system is easy to expand. A storage device is, for example, a storage server, or a combination of a storage controller and a storage medium.
1. Data Storage Process:
To ensure that the data is evenly stored on each storage node 104, a distributed hash table (DHT) mode is usually used for routing when a storage node is selected. However, this is not limited in this embodiment of this disclosure. To be specific, in the technical solutions in the embodiments of this disclosure, various possible routing modes in the storage system may be used. According to a distributed hash table mode, a hash ring is evenly divided into several parts, each part is referred to as a partition, and each partition corresponds to a storage space of a specified size. It may be understood that a larger quantity of partitions indicates a smaller storage space corresponding to each partition, and a smaller quantity of partitions indicates a larger storage space corresponding to each partition. In an actual application, the quantity of partitions is usually relatively large (4096 partitions are used as an example in this embodiment). For ease of management, these partitions are divided into a plurality of partition groups, and each partition group includes a same quantity of partitions. If absolute equal division cannot be achieved, ensure that a quantity of partitions in each partition group is basically the same. For example, 4096 partitions are divided into 144 partition groups, where a partition group 0 includes partitions 0 to 27, a partition group 1 includes partitions 28 to 57, . . . , and a partition group 143 includes partitions 4066 to 4095. A partition group has its own identifier, and the identifier is used to uniquely identify the partition group. Similarly, a partition also has its own identifier, and the identifier is used to uniquely identify the partition. An identifier may be a number, a character string, or a combination of a number and a character string. In this embodiment, each partition group corresponds to one storage node 104, and “correspond” means that all data that is of a same partition group and that is located by using a hash value is stored on a same storage node 104.
The client server 101 sends a write request to any storage node 104, where the write request carries to-be-written data and a virtual address of the data. The virtual address includes an identifier and an offset of a logical unit (LU) into which the data is to be written, and the virtual address is an address visible to the client server 101. The storage node 104 that receives the write request performs a hash operation based on the virtual address of the data to obtain a hash value, and a target partition may be uniquely determined by using the hash value. After the target partition is determined, a partition group in which the target partition is located is also determined. According to a correspondence between a partition group and a storage node, the storage node that receives the write request may forward the write request to a storage node corresponding to the partition group. One partition group corresponds to one or more storage nodes. The corresponding storage node (referred to as a first storage node herein for distinguishing from another storage node 104) writes the write request into a cache of the corresponding storage node, and performs persistent storage when a condition is met.
In this embodiment, each storage node includes at least one storage unit. The storage unit is a logical space, and an actual physical space is still provided by a plurality of storage nodes. Referring to
When data in the cache of the first storage node reaches a specified threshold, the data may be sliced into a plurality of data slices based on the specified RAID type, and check slices are obtained through calculation. The data slices and the check slices are stored on the storage unit. The data slices and corresponding check slices form a stripe. One storage unit may store a plurality of stripes, and is not limited to the three stripes shown in
2. Metadata Storage Process:
After data is stored on a storage node, to find the data at later time, description information of the data further needs to be stored. The description information describing the data is referred to as metadata. When receiving a read request, the storage node usually finds metadata of to-be-read data based on a virtual address carried in the read request, and further obtains the to-be-read data based on the metadata. The metadata includes but is not limited to a correspondence between a logical address and a physical address of each slice, and a correspondence between a virtual address of the data and a logical address of each slice included in the data. A set of logical addresses of all slices included in the data is a logical address of the data.
Similar to the data storage process, a partition in which the metadata is located is also determined based on a virtual address carried in a read request or a write request. Further, a hash operation is performed on the virtual address to obtain a hash value, and a target partition may be uniquely determined by using the hash value. Therefore, a target partition group in which the target partition is located is further determined, and then to-be-stored metadata is sent to a storage node (for example, a first storage node) corresponding to the target partition group. When the to-be-stored metadata in the first storage node reaches a specified threshold (for example, 32 KB), the metadata is sliced into four data slices, and then two check slices are obtained through calculation. Then, these slices are sent to a plurality of storage nodes.
In this embodiment, a partition of the data and a partition of the metadata are independent of each other. In other words, the data has its own partition mechanism, and the metadata also has its own partition mechanism. However, a total quantity of partitions of the data is the same as a total quantity of partitions of the metadata. For example, the total quantity of the partitions of the data is 4096, and the total quantity of the partitions of the metadata is also 4096. For ease of description, in this embodiment of the present disclosure, a partition corresponding to the data is referred to as a data partition, and a partition corresponding to the metadata is referred to as a metadata partition. A partition group corresponding to the data is referred to as a data partition group, and a partition group corresponding to the metadata is referred to as a metadata partition group. Because both the metadata partition and the data partition are determined based on the virtual address carried in the read request or the write request, metadata corresponding to one metadata partition is used to describe data corresponding to a data partition that has a same identifier as the metadata partition. For example, metadata corresponding to a metadata partition 1 is used to describe data corresponding to a data partition 1, metadata corresponding to a metadata partition 2 is used to describe data corresponding to a data partition 2, and metadata corresponding to a metadata partition N is used to describe data corresponding to a data partition N, where N is an integer greater than or equal to 2. Data and metadata of the data may be stored on a same storage node, or may be stored on different storage nodes.
After the metadata is stored, when receiving a read request, the storage node may learn a physical address of the to-be-read data by reading the metadata. Further, when any storage node 104 receives a read request sent by the client server 101, the node 104 performs hash calculation on a virtual address carried in the read request to obtain a hash value, to obtain a metadata partition corresponding to the hash value and a metadata partition group of the metadata partition. Assuming that a storage unit corresponding to the metadata partition group belongs to the first storage node, the storage node 104 that receives the read request forwards the read request to the first storage node. The first storage node reads metadata of the to-be-read data from the storage unit. The first storage node then obtains, from a plurality of storage nodes based on the metadata, slices forming the to-be-read data, aggregates the slices into the to-be-read data after verifying that the slices are correct, and returns the to-be-read data to the client server 101.
3. Capacity Expansion:
As more data is stored on the storage system 100, a storage space of the storage system 100 is gradually reduced. Therefore, a quantity of the storage nodes in the storage system 100 needs to be increased. This process is referred to as capacity expansion. After a new storage node (new node) is added to the storage system 100, the storage system 100 migrates partitions of old storage nodes (old node) and data corresponding to the partitions to the new node. For example, assuming that the storage system 100 originally has eight storage nodes, and has 16 storage nodes after capacity expansion, half of partitions and data corresponding to the partitions in the original eight storage nodes need to be migrated to the eight new storage nodes. To save bandwidth resources between the storage nodes, currently only metadata partitions and metadata corresponding to the metadata partitions are migrated, and data partitions are not migrated. After the metadata is migrated to the new storage node, because the metadata records a correspondence between a logical address and a physical address of the data, even if the client server 101 sends a read request to the new node, a location of the data on an original node may be found according to the correspondence to read the data. For example, if the metadata corresponding to the metadata partition 1 is migrated to the new node, when the client server 101 sends a read request to the new node to request to read the data corresponding to the data partition 1, although the data corresponding to the data partition 1 is not migrated to the new node, a physical address of the to-be-read data may still be found based on the metadata corresponding to the metadata partition 1, to read the data from the original node.
In addition, partitions and data of the partitions are migrated by partition group during node capacity expansion. If metadata corresponding to a metadata partition group is less than metadata used to describe data corresponding to a data partition group, a same storage unit is referenced by at least two metadata partition groups. This makes management inconvenient.
Generally, a quantity of partitions included in the metadata partition group is usually less than a quantity of partitions included in the data partition group. Referring to
To resolve the foregoing problem, in this embodiment, the quantity of the partitions included in the metadata partition group is set to be greater than or equal to the quantity of the partitions included in the data partition group. In other words, metadata corresponding to one metadata partition group is greater than or equal to metadata used to describe data corresponding to one data partition group. For example, each metadata partition group includes 64 partitions, and each data partition group includes 32 partitions. As shown in
Therefore, in this embodiment, before capacity expansion, the metadata partition group and the data partition group are configured, so that the quantity of the partitions included in the metadata partition group is set to be greater than the quantity of the partitions included in the data partition group. After capacity expansion, the metadata partition group on the original node is split into at least two metadata partition subgroups, and then at least one metadata partition subgroup and metadata corresponding to the at least one metadata partition subgroup is migrated to the new node. Then, the data partition group on the original node is split into at least two data partition subgroups, so that a quantity of partitions included in the metadata partition subgroups is set to be greater than or equal to a quantity of partitions included in the data partition subgroup, to prepare for next capacity expansion.
The following uses a specific example to describe the process of capacity expansion. Referring to
In this embodiment, a quantity of partition groups allocated to each storage node may be preset. When the storage node includes a plurality of processing units, to evenly distribute read and write requests on the processing units, in this embodiment of the present disclosure, each processing unit may be set to correspond to a specific quantity of partition groups, where the processing unit is a central processing unit (CPU) on the node, as shown in Table 1:
The table 1 describes a relationship between the nodes and the processing unit of the nodes, and a relationship between the nodes and the partition groups. For example, if each node has eight processing units, and six partition groups are allocated to each processing unit, a quantity of partition groups allocated to each node is 48. Assuming that the storage system 100 has three storage nodes before capacity expansion, a quantity of partition groups in the storage system 100 is 144. According to the foregoing description, a total quantity of partitions is configured when the storage system 100 is initialized. For example, the total quantity of partitions is 4096. To evenly distribute the 4096 partitions in the 144 partition groups, each partition group needs to have 4096/144=28.44 partitions. However, the quantity of partitions included in each partition group needs to be an integer and 2 to the power of N, where N is an integer greater than or equal to 0. Therefore, the 4096 partitions cannot be absolutely evenly distributed in the 144 partition groups. It may be determined that 28.44 is less than 32 (2 to the power of 5) and greater than 16 (2 to power of 4). Therefore, X first partition groups in the 144 partition groups each include 32 partitions, and Y second partition groups each include 16 partitions. X and Y meet the following equations: 32X+16Y=4096, and X+Y=144.
X=112 and Y=32 are obtained through calculation by using the foregoing two equations. This means that there are 112 first partition groups and 32 second partition groups in the 144 partition groups, where each first partition group includes 32 partitions and each second partition group includes 16 partitions. Then, a quantity (112/(3×8)=4, . . . , or 16) of the first partition groups configured for each processing unit is calculated based on a total quantity of the first partition groups and a total quantity of the processing units, and a quantity (32/(3×8)=1, . . . , or 8) of the second partition groups configured for each processing unit is calculated based on a total quantity of the second partition groups and the total quantity of the processing units. Therefore, it can be learned that at least four first partition groups and two second partition groups are configured for each processing unit, and the remaining eight second partitions are evenly distributed on three nodes as much as possible (as shown in
Referring to
X=16 and Y=224 are obtained through calculation by using the foregoing two equations. This means that there are 16 first partition groups and 224 second partition groups in the 240 partition groups, where each first partition group includes 32 partitions and each second partition group includes 16 partitions. Then, a quantity (16/(5×8)=0, . . . , or 16) of the first partition groups configured for each processing unit is calculated based on a total quantity of the first partition groups and a total quantity of the processing units, and a quantity (224/(5×8)=5, . . . , or 24) of the second partition groups configured for each processing unit is calculated based on a total quantity of the second partition groups and the total quantity of the processing units. Therefore, it can be learned that one first partition group is configured for only 16 processing units, at least five second partition groups are configured for each processing unit, and the remaining 24 second partitions are evenly distributed on five nodes as much as possible (as shown in
According to a schematic diagram of a partition layout of the three nodes before capacity expansion and a schematic diagram of a partition layout of the five nodes after capacity expansion, some of the first partition groups on the three nodes before capacity expansion may be split into two second partition groups, and then according to the distribution of partitions of each node shown in
In the foregoing example, the three storage nodes before capacity expansion first split some of the first partition groups into second partition groups and then migrate the second partition groups to the new nodes. In another implementation, the three storage nodes may first migrate some of the first partition groups to the new nodes and then split the first partition groups. In this way, the distribution of the partitions shown in
It should be noted that the foregoing description and the example in
4. Junk Data Collection:
When there is a relatively large amount of junk data in the storage system 100, junk data collection may be started. In this embodiment, junk data collection is performed based on storage units. One storage unit is selected as an object for junk data collection, valid data on the storage unit is migrated to a new storage unit, and then a storage space occupied by the original storage unit is released. The selected storage unit needs to meet a specific condition. For example, junk data included on the storage unit reaches a first specified threshold, the storage unit is a storage unit that includes the largest amount of junk data and that is in the plurality of storage units, valid data included on the storage unit is less than a second specified threshold, or the storage unit is a storage unit that includes least valid data and that is in the plurality of storage units. For ease of description, in this embodiment, the selected storage unit on which junk data collection is performed is referred to as a first storage unit or the storage unit 1.
Referring to
Referring to
The following describes, with reference to a flowchart, a node capacity expansion method provided in this embodiment. Referring to
S701: Configure a data partition group and a metadata partition group of a first node. The data partition group includes a plurality of data partitions, and the metadata partition group includes a plurality of metadata partitions. Metadata of data corresponding to the configured data partition group is a subset of metadata corresponding to the metadata partition group. The subset herein has two meanings. One is that the metadata corresponding to the metadata partition group includes metadata used to describe the data corresponding to the data partition group. The other one is that a quantity of the metadata partitions included in the metadata partition group is greater than a quantity of the data partitions included in the data partition group. For example, the data partition group includes M data partitions: a data partition 1, a data partition 2, . . . , and a data partition M. The metadata partition group includes N metadata partitions, where N is greater than M, and the metadata partitions are a metadata partition 1, a metadata partition 2, . . . , a metadata partition M, . . . , and a metadata partition N. According to the foregoing description, metadata corresponding to the metadata partition 1 is used to describe data corresponding to the data partition 1, metadata corresponding to the metadata partition 2 is used to describe data corresponding to the data partition 2, and metadata corresponding to the metadata partition M is used to describe data corresponding to the data partition M. Therefore, the metadata partition group includes all metadata used to describe data corresponding to the M data partitions. In addition, the metadata partition group further includes metadata used to describe data corresponding to another data partition group.
The first node described in S701 is the original node described in the capacity expansion part. In addition, it should be noted that first node may include one or more data partition groups. Similarly, the first node may include one or more metadata partition groups.
S702: When a second node is added to the storage system, split the metadata partition group into at least two metadata partition subgroups. When the first node includes one metadata partition group, this metadata partition group needs to be split into at least two metadata partition subgroups. When the first node includes a plurality of metadata partition groups, it is possible that only some metadata partition groups need to be split, and the remaining metadata partition groups continue to maintain original metadata partitions. Which metadata partition groups need to be split and how to split the metadata partition groups may be determined based on a metadata partition group layout after capacity expansion and a metadata partition group layout before capacity expansion. The metadata partition group layout after capacity expansion includes a quantity of the metadata partition subgroups configured for each node in the storage system after the second node is added to the storage system, and a quantity of metadata partitions included in the metadata partition subgroup after the second node is added to the storage system. The metadata partition group layout before capacity expansion includes a quantity of the metadata partition groups configured for the first node before the second node is added to the storage system, and a quantity of metadata partitions included in the metadata partition groups before the second node is added to the storage system. For specific implementation, refer to descriptions related to
In actual implementation, splitting refers to changing a mapping relationship. Further, before splitting, there is a mapping relationship between an identifier of an original metadata partition group and an identifier of each metadata partition included in the original metadata partition group. After splitting, identifiers of at least two metadata partition subgroups are added, the mapping relationship between the identifier of the metadata partition included in the original metadata partition group and the identifier of the original metadata partition group is deleted, and a mapping relationship between identifiers of some metadata partitions included in the original metadata partition group and an identifier of one of the metadata partition subgroups and a mapping relationship between identifiers of another part of metadata partitions included in the original metadata partition group and an identifier of another metadata partition subgroup are established.
S703: Migrate one metadata partition subgroup and metadata corresponding to the metadata partition subgroup to the second node. The second node is the new node described in the capacity expansion part.
Migrating a partition group refers to changing a homing relationship. Further, migrating the metadata partition subgroup to the second node refers to modifying a correspondence between the metadata partition subgroup and the first node to a correspondence between the metadata partition subgroup and the second node. Metadata migration refers to actual movement of data. Further, migrating the metadata corresponding to the metadata partition subgroup to the second node refers to copying the metadata to the second node and deleting the metadata reserved in the first node.
The data partition group and the metadata partition group of the first node are configured in S701, so that metadata of data corresponding to the configured data partition group is a subset of metadata corresponding to the metadata partition group. Therefore, even if the metadata partition group is split into at least two metadata partition subgroups, the metadata of the data corresponding to the data partition group is still a subset of metadata corresponding to one of the metadata partition subgroups. In this case, after one of the metadata partition subgroups and the metadata corresponding to the metadata partition subgroup are migrated to the second node, the data corresponding to the data partition group is still described by metadata stored on one node. This avoids modifying metadata on different nodes when data is modified especially when junk data collection is performed.
To ensure that during next capacity expansion, the metadata of the data corresponding to the data partition group is still a subset of metadata corresponding to a metadata partition subgroup, S704 may be further performed after S703.
S704: Split the data partition group in the first node into at least two data partition subgroups, where metadata of data corresponding to the data partition subgroup is a subset of the metadata corresponding to the metadata partition subgroup. A definition of splitting herein is the same as that of splitting in S702.
In the node capacity expansion method provided in
In addition, in the node capacity expansion method provided in
In addition, in various scenarios to which the node capacity expansion method provided in this embodiment is applicable, neither the data partition group nor the data corresponding to the data partition group needs to be migrated to the second node. If the second node receives a read request, the second node may find a physical address of to-be-read data based on metadata stored on the second node, to read the data. Because a data volume of the metadata is greatly less than a data volume of the data to avoid migrating the data to the second node, bandwidth between the nodes can be greatly saved.
An embodiment further provides a node capacity expansion apparatus. As shown in
The configuration module 801 is adapted to configure a data partition group and a metadata partition group of a first node in a storage system. The data partition group includes a plurality of data partitions, the metadata partition group includes a plurality of metadata partitions, and metadata of data corresponding to the data partition group is a subset of metadata corresponding to the metadata partition group. Further, refer to the description of S701 shown in
The splitting module 802 is adapted to, when a second node is added to the storage system, split the metadata partition group into at least two metadata partition subgroups. Further, refer to the description of S702 shown in
The migration module 803 is adapted to migrate one metadata partition subgroup in the at least two metadata partition subgroups and metadata corresponding to the metadata partition subgroup to the second node. Further, refer to the description of S703 shown in
Optionally, the apparatus further includes an obtaining module 804, adapted to obtain a metadata partition group layout after capacity expansion and a metadata partition group layout before capacity expansion. The metadata partition group layout after capacity expansion includes a quantity of the metadata partition subgroups configured for each node in the storage system after the second node is added to the storage system, and a quantity of metadata partitions included in the metadata partition subgroup after the second node is added to the storage system. The metadata partition group layout before capacity expansion includes a quantity of the metadata partition groups configured for the first node before the second node is added to the storage system, and a quantity of metadata partitions included in the metadata partition groups before the second node is added to the storage system. The splitting module 802 is further adapted to split the metadata partition group into at least two metadata partition subgroups based on the metadata partition group layout after capacity expansion and the metadata partition group layout before capacity expansion.
Optionally, the splitting module 802 is further adapted to, after migrating at least one metadata partition subgroup and metadata corresponding to the at least one metadata partition subgroup to the second node, split the data partition group into at least two data partition subgroups. Metadata of data corresponding to the data partition subgroup is a subset of metadata corresponding to the metadata partition subgroup.
Optionally, the configuration module 801 is further adapted to, when the second node is added to the storage system, keep the data corresponding to the data partition group still being stored on the first node.
An embodiment further provides a storage node. The storage node may be a storage array or a server. When the storage node is a storage array, the storage node includes a storage controller and a storage medium. For a structure of the storage controller, refer to a schematic diagram of a structure in
The processor 901 is a single-core or multi-core central processing unit, or an application-specific integrated circuit, or may be configured as one or more integrated circuits for implementing this embodiment of the present disclosure. The memory 902 may be a high-speed random-access memory (RAM), or may also be a non-volatile memory, for example, at least one hard disk memory. The memory 902 is adapted to store a computer-executable instruction. Further, the computer-executable instruction may include the program 903. When the storage node runs, the processor 901 runs the program 903 to perform the method procedure of S701 to S704 shown in
Functions of the configuration module 801, the splitting module 802, the migration module 803, and the obtaining module 804 that are shown in
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When software is used to implement the embodiments, all or some of the embodiments may be implemented in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the procedure or functions according to the embodiments of this disclosure are all or partially generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable apparatuses. The computer instructions may be stored on a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, storage node, or data center to another website, computer, storage node, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible to a computer, or a data storage device, such as a storage node or a data center, integrating one or more usable mediums. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a digital versatile disc (DVD)), a semiconductor medium (for example, an SSD), or the like.
It should be understood that, in the embodiments of this disclosure, the term “first” and the like are merely intended to indicate objects, but do not indicate a sequence of corresponding objects.
A person of ordinary skill in the art may be aware that, in combination with the examples described in the embodiments disclosed in this specification, units and algorithm steps may be implemented by electronic hardware or a combination of computer software and electronic hardware. Whether the functions are performed by hardware or software depends on particular applications and design constraint conditions of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this disclosure.
It may be clearly understood by a person skilled in the art that, for the purpose of convenient and brief description, for a detailed working process of the foregoing system, apparatus, and unit, refer to a corresponding process in the foregoing method embodiments, and details are not described herein again.
In the several embodiments provided in this disclosure, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other manners. For example, the described apparatus embodiments are merely examples. For example, division into the units is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communications connections may be implemented by using some interfaces. The indirect couplings or communications connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of this disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
When the functions are implemented in the form of a software functional unit and sold or used as an independent product, the functions may be stored on a computer-readable storage medium. Based on such an understanding, the technical solutions of this disclosure essentially, or the part contributing to the other approaches, or some of the technical solutions may be implemented in a form of a software product. The software product is stored on a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a storage node, a network device, or the like) to perform all or some of the steps of the methods described in the embodiments of this disclosure. The foregoing storage medium includes any medium that can store program code, such as a Universal Serial Bus (USB) flash drive, a removable hard disk, a read-only memory (ROM), a RAM, a magnetic disk, or an optical disc.
The foregoing descriptions are merely specific implementations of this disclosure, but are not intended to limit the protection scope of this disclosure. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in this disclosure shall fall within the protection scope of this disclosure. Therefore, the protection scope of this disclosure shall be subject to the protection scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
201811249893.8 | Oct 2018 | CN | national |
201811571426.7 | Dec 2018 | CN | national |
This application is a continuation of International Patent Application No. PCT/CN2019/111888 filed on Oct. 18, 2019, which claims priority to Chinese Patent Application No. 201811571426.7 filed on Dec. 21, 2018 and Chinese Patent Application No. 201811249893.8 filed on Oct. 25, 2018. All of the aforementioned patent applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2019/111888 | Oct 2019 | US |
Child | 17239194 | US |