The present invention relates to the field of information technologies, and in particular, to a stripe reassembling method in a storage system and a stripe server.
A distributed storage system includes a plurality of storage nodes. When a client receives a write request sent by a host and writes data into the distributed storage system, the data is stored in a corresponding storage node in a form of a stripe. For example, according to an erasure coding (EC) algorithm, a quantity of data strips in a stripe is N, a quantity of parity strips in the stripe is M, and a length of the stripe is N+M, where both N and M are positive integers. In the distributed storage system, a stripe with a length of N+M includes N+M strips with a same length, and each strip is distributed on one storage node in the distributed storage system. Therefore, one strip is distributed on N+M storage nodes.
The client receives a data write request, where the write request is used to modify data of a data strip that has been stored in a stripe. Generally, the client writes the data of the data write request to a new stripe, to mark data of a data strip in an original stripe as invalid. When a quantity of data strips that store invalid data in a stripe reaches a value, the distributed storage system reassembles the stripe, and reassembles valid data of data strips in a plurality of stripes into a new stripe. In this process, valid data needs to be migrated across storage nodes.
This application provides a stripe reassembling method in a storage system and a computer program product, to reduce cross-node migration of valid data in a stripe reassembling process.
According to a first aspect, this application provides a stripe reassembling method in a storage system. The storage system includes N data storage nodes that store data strips and M parity storage nodes that store parity strips. R stripes are distributed on the N+M storage nodes, each stripe Si includes N data strips and M parity strips, a data strip Dix is distributed on an xth storage node in the N data storage nodes, and a parity strip Piy is distributed on a yth storage node in the M parity storage nodes. N, M, and R are positive integers, R is not less than 2, a value of i is an integer ranging from 1 to R, a value of x is an integer ranging from 1 to N, and a value of y is an integer ranging from 1 to M. A stripe server selects the R stripes, where a maximum of one data strip includes valid data in data strips Dix that are in the R stripes and that are distributed on a same data storage node. In this solution, the stripe server generates parity data of a parity strip PKy in a new stripe SK for data of the data strip including valid data in the R stripes, where K is an integer different from 1 to R. The new stripe SK includes the data strip including valid data in the R stripes and the parity strip PKy. The stripe server stores the parity data of the parity strip PKy on the yth storage node in the M parity storage nodes. In the stripe reassembling method, a data strip including valid data in a plurality of stripes is used as a data strip in a new stripe. In addition, a maximum of one data strip includes valid data in data strips Dx that are in the plurality of stripes and that are on distributed a same data storage node. Therefore, data of the data strip does not need to be migrated across storage nodes during stripe reassembling.
With reference to the first aspect of this application, in a possible implementation, the method further includes: After the stripe server stores the parity data of the parity strip PKy on the yth parity storage node in the M parity storage nodes, the stripe server indicates a data storage node on which a data strip that stores garbage data of the R stripes is distributed to perform garbage collection. In another implementation, the data storage node checks a status of stored data of a data strip, and starts garbage collection when the data of the data strip is determined as garbage data. In an implementation, when storing the data of the data strip, the data storage node creates a mapping between a host access address of the data of the data strip and an identifier of the data strip. When the data of the data strip is garbage data, the mapping between the host access address of the data of the data strip and the identifier of the data strip is marked as invalid, so that the data of the data strip is determined as garbage data. Further, the data storage node may perform garbage collection in the data strip after the stripe server releases a stripe including the data strip. In another implementation, the data storage node may determine, based on an amount of garbage data of the data strip stored in the data storage node, to start garbage collection. Alternatively, a parity storage node may determine, based on whether data of a parity strip is garbage data, whether to start garbage collection, or perform garbage collection based on an indication of the stripe server.
With reference to the first aspect of this application and the foregoing possible implementation of the first aspect, in another possible implementation, the stripe server records a mapping between the new stripe and a strip identifier, where the strip identifier includes an identifier of the data strip storing valid data in the R stripes and an identifier of the parity strip in the new stripe.
With reference to the first aspect of this application and the foregoing possible implementations of the first aspect, in another possible implementation, the stripe server releases the R stripes. After the stripe server releases the R stripes, the R stripes may be reallocated to newly written data.
With reference to the first aspect of this application and the foregoing possible implementations of the first aspect, in another possible implementation, the stripe server determines the stripe Si, where a quantity of data strips including garbage data in the stripe Si meets a reassembly threshold. The stripe server determines, based on a quantity of data strips including garbage data in a stripe, whether the reassembly threshold is reached in the stripe. The stripe server records a stripe in which the reassembly threshold is reached. In an implementation, an identifier of the stripe in which the reassembly threshold is reached may be recorded by using a link.
According to a second aspect, this application provides a stripe server. The stripe server is used in a storage system, and the storage system includes N data storage nodes storing data strips and M parity storage nodes storing parity strips. R stripes are distributed on the N+M storage nodes, each stripe Si includes N data strips and M parity strips, a data strip Dix is distributed on an xth storage node in the N data storage nodes, and a parity strip Piy is distributed on a yth storage node in the M parity storage nodes. N, M, and R are positive integers, R is not less than 2, a value of i is an integer ranging from 1 to R, a value of x is an integer ranging from 1 to N, and a value of y is an integer ranging from 1 to M. The stripe server includes an interface and a processor, the interface communicates with the processor, and the processor is configured to perform the first aspect of this application and the possible implementations of the first aspect.
According to a third aspect, this application provides a stripe server. The stripe server is used in a storage system, and the storage system includes N data storage nodes storing data strips and M parity storage nodes storing parity strips. R stripes are distributed on the N+M storage nodes, each stripe Si includes N data strips and M parity strips, a data strip Dix is distributed on an xth storage node in the N data storage nodes, and a parity strip Piy is distributed on a yth storage node in the M parity storage nodes. N, M, and R are positive integers, R is not less than 2, a value of i is an integer ranging from 1 to R, a value of x is an integer ranging from 1 to N, and a value of y is an integer ranging from 1 to M. The stripe server includes corresponding units, and each unit is configured to perform corresponding operations of the first aspect of this application and the possible implementations of the first aspect.
According to a fourth aspect, this application provides a computer program product. The computer program product includes a computer instruction, the computer program product is used in a storage system, and the storage system includes N data storage nodes storing data strips and M parity storage nodes storing parity strips. R stripes are distributed on the N+M storage nodes, each stripe Si includes N data strips and M parity strips, a data strip Dix is distributed on an xth storage node in the N data storage nodes, and a parity strip Piy is distributed on a yth storage node in the M parity storage nodes. N, M, and R are positive integers, R is not less than 2, a value of i is an integer ranging from 1 to R, a value of x is an integer ranging from 1 to N, and a value of y is an integer ranging from 1 to M. A stripe server that is used in the storage system executes the computer instruction, to perform the first aspect of this application and the possible implementations of the first aspect.
A storage system in the embodiments of the present invention may be a distributed storage system, for example, FusionStorage® series or OceanStor® 9000 series of Huawei®. For example, as shown in
The server in the distributed storage system is in a structure shown in
A client in the distributed storage system writes data into the distributed storage system based on a write request of a host, or reads data from the distributed storage system based on a read request of the host. The server in the embodiments of the present invention may be used as a client. In addition, the client may also be a device independent of the server shown in
In the embodiments of the present invention, the distributed block storage system is used as an example. A client provides a block protocol access interface, so that the client provides a distributed block storage access point service. A host may access a storage resource in a storage resource pool in the distributed block storage system by using the client. Generally, the block protocol access interface is configured to provide a logical unit for the host. A server runs a program of the distributed block storage system, so that a server that includes a hard disk is used as a storage node to store data of the client. For example, one hard disk may be used as one storage node by default in the server. When the server includes a plurality of hard disks, one server may be used as a plurality of storage nodes. In another implementation, the server runs the program of the distributed block storage system program to serve as one storage node. This is not limited in the embodiments of the present invention. Therefore, for a structure of the storage node, refer to
Based on a reliability requirement of the distributed block storage system, data reliability may be improved by using an erasure coding (EC) algorithm. For example, a 3+1 mode is used, a stripe includes three data strips and one parity strip. In the embodiments of the present invention, data is stored in a partition in a form of stripe, and one partition includes R stripes Si, where i is an integer ranging from 1 to R. In the embodiments of the present invention, P2 is used as an example for description.
It should be noted that a stripe is logical storage space, the stripe includes a plurality of strips, and a strip is also referred to as a strip unit. “Data of a strip” is content stored in the strip.
Strips included in a stripe include a data strip and a parity strip (parity strip). A strip used to store data (based on application scenarios, data may be user data, service data, and application data) is referred to as a data strip, and a strip used to store parity data is referred to as a parity strip. There is a parity relationship between data and parity data. According to the EC algorithm, when content in some strips is faulty or cannot be read, content in the other strips may be used to restore. In the embodiments of the present invention, a storage node that stores data of a data strip is referred to as a data storage node, a storage node that stores parity data of a parity strip is referred to as a parity storage node, and the data storage node and the parity storage node are collectively referred to as storage nodes.
In the distributed block storage system, slice management is performed on the hard disk by 8 KB (KB), and allocation information of each 8 KB slice is recorded in a metadata management area of the hard disk. Slices of the hard disk form a storage resource pool. The distributed block storage system includes a stripe server. An implementation may be that a stripe management program is run on one or more servers in the distributed block storage system. The stripe server allocates a stripe to a partition. For a structure of the stripe server, refer to the server structure shown in
To reduce a quantity of strip identifiers managed by the stripe server, the stripe server allocates a version number to a strip identifier in a stripe. After a stripe is released, a version number of a strip identifier of a strip in the released stripe is updated, so that a version number of the strip identifier is used as a strip identifier of a strip in a new stripe. The stripe server pre-allocates a strip to the stripe Si, so that waiting time may be reduced when the client writes data, thereby improving write performance of the distributed block storage system. In the embodiments of the present invention, a strip in the stripe Si has a unique identifier in the distributed block storage system.
In the embodiments of the present invention, an example in which a stripe length is 4 is used. The stripe includes three data strips and one parity strip. Distribution of stripes on storage nodes is shown in
In an embodiment of the present invention shown in
The selected stripes are to-be-reassembled stripes, and in the to-be-reassembled stripes, a maximum of one data strip includes valid data in data strips distributed on a same data storage node. For example, strip distribution of stripes S1, S2, and S3 is the same, and there are three groups of data strips on three data storage nodes. A first group includes D11, D21, and D31, the second group includes D12, D22, and D32, and the third group includes D13, D23, and D33. As shown in
In this embodiment of the present invention, valid data is data whose host access address undergoes no write operation after the data is written into a data storage node.
Step 702: Use the data strips (D21, D32, and D13) including valid data in the stripes S1, S2, and S3 as data strips in a new stripe S4, and generate data of a parity strip P41 for data of the data strips in S4 according to an EC algorithm the same as that of S1, S2, and S3.
Step 703: Store the parity data of the parity strip P41 on a parity storage node N1.
The stripe server records a mapping between the stripe S4 and the strips D21, D32, D13, and P41, and the strips D21, D32, D13, and P41 are strips in the stripe S4.
In the embodiment shown in
Further, the stripe server indicates a storage node on which a data strip that stores garbage data is distributed to perform garbage collection. In this embodiment of the present invention, the stripe server indicates data storage nodes that store data of the data strips D11, D12, D22, D23, D31, and D33 to perform garbage collection. Further, parity data stored in P11, P21, and P31 is meaningless. Therefore, the stripe server further indicates the parity storage node that stores the parity data of the parity strips P11, P21, and P31 to perform garbage collection. After a storage node performs garbage collection, storage space occupied by garbage data on the storage node is released. In another implementation, a storage node performs garbage collection based on a status of stored data of a strip. For example, a data storage node checks a status of stored data of a data strip, and starts garbage collection when the data of the data strip is determined as garbage data. In an implementation, when storing the data of the data strip, the data storage node creates a mapping between a host access address of the data of the data strip and an identifier of the data strip. When the data of the data strip is garbage data, the mapping between the host access address of the data of the data strip and the identifier of the data strip is marked as invalid, so that the data of the data strip is determined as garbage data. Further, the data storage node may perform garbage collection in the data strip after the stripe server releases a stripe including the data strip. In another implementation, the data storage node may determine, based on an amount of garbage data in data strips stored in the data storage node, to start garbage collection. For example, a total quantity of data strips stored in the data storage node is 1000, and a quantity of data strips that stores garbage data is 600. If a threshold for starting garbage collection by the data storage node is 60%, when a quantity of data strips that stores garbage data reaches 60%, the data storage node starts garbage collection. Therefore, each data storage node in a distributed system may independently perform garbage collection in a data strip. This embodiment of the present invention is also applicable to garbage collection by a parity storage node that stores parity strips. In this embodiment of the present invention, a stripe is reassembled with another stripe, data of a parity strip in the original stripe is garbage data, and garbage collection described in this embodiment of the present invention also needs to be performed. For an implementation, refer to garbage collection in a data strip. In this embodiment of the present invention, a plurality of storage nodes may independently perform garbage collection in a strip by stripe reassembling.
The stripe server releases the stripes S1, S2, and S3. That the stripe server releases the stripes S1, S2, and S3 includes: The stripe server sets the stripes S1, S2, and S3 to an idle state, so that new data may be subsequently allocated to the stripes S1, S2, and S3.
According to stripe reassembling method provided in this embodiment of the present invention, valid data does not need to be migrated across storage nodes, thereby improving stripe reassembling performance of a storage system.
In this embodiment of the present invention, in the selected to-be-reassembled stripes, a maximum of one data strip includes valid data in data strips distributed on a same data storage node. An implementation may be that in the to-be-reassembled stripes, no data strip includes valid data in data strips distributed on a same data storage node. In this scenario, data strips in a reassembled stripe include an idle data strip, no data strip includes valid data at a same location, and content of the data strips at the same location is empty in a new stripe. In an implementation of this embodiment of the present invention, two stripes may be further reassembled.
Based on the foregoing description, an embodiment of the present invention further provides a stripe server, used in a storage system in the embodiments of the present invention, for example, a distributed storage system. The storage system includes N data storage nodes that store data strips and M parity storage nodes that store parity strips. R stripes are distributed on the N+M storage nodes, each stripe Si includes N data strips and M parity strips, a data strip Dix is distributed on an xth storage node in the N data storage nodes, and a parity strip Piy is distributed on a yth storage node in the M parity storage nodes. N, M, and R are positive integers, R is not less than 2, a value of i is an integer ranging from 1 to R, a value of x is an integer ranging from 1 to N, and a value of y is an integer ranging from 1 to M. As shown in
Further, the stripe server shown in
For an implementation of the stripe server shown in
Correspondingly, an embodiment of the present invention further provides a computer-readable storage medium and a computer program product. The computer-readable storage medium and the computer program product include a computer instruction, to implement various solutions described in the embodiments of the present invention.
In the embodiments of the present invention, identifiers used to describe stripes, data strips, parity strips, and storage nodes are merely used to describe the embodiments of the present invention more clearly. In actual product implementation, similar identifiers are not necessarily required. Therefore, in the embodiments of the present invention, the identifiers used to describe the stripes, the data strips, the parity strips, and the storage nodes are not intended to limit the present invention.
In this embodiment of the present invention, that the stripe Si includes N data strips and M parity strips, a data strip Dix is distributed on an xth data storage node in the N data storage nodes, and a parity strip Piy is distributed on a yth parity storage node in the M parity storage nodes, means that the xth data storage node in the N data storage nodes provides a storage address for the data strip Dix, data of the data strip Dix is stored at the storage address provided by the xth data storage node in the N data storage nodes, the yth parity storage node in the M parity storage nodes provides a storage address for the parity strip Piy, and parity data of the parity strip Piy is stored at the storage address provided by the yth parity storage node in the M parity storage nodes. Therefore, this is also referred to as that the stripe Si is distributed on the N+M storage nodes.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, division into the units in the described apparatus embodiment is merely logical function division and another division may be used in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented through some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual requirements to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
Number | Date | Country | Kind |
---|---|---|---|
201811589151.X | Dec 2018 | CN | national |
This application is a continuation of International Application No. PCT/CN2019/103281, filed on Aug. 29, 2019, which claims priority to Chinese Patent Application No. 201811589151.X, filed on Dec. 25, 2018, the disclosures of which are hereby incorporated by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2019/103281 | Aug 2019 | US |
Child | 17356849 | US |