The present invention relates to the field of information technologies, and in particular, to a data storage method and apparatus in a distributed storage system, and a computer program product.
A distributed storage system includes a storage node and a stripe, each stripe includes a plurality of strips, and a storage node corresponds to a strip in the stripe, that is, a storage node provides storage space for a strip in the stripe. Generally, as shown in
This application provides a data storage method and apparatus in a distributed storage system, and a computer program product, without requiring a primary storage node, so that under the premise of ensuring data consistency, data interaction between storage nodes is reduced, and write performance of the distributed storage system is improved.
A first aspect of this application provides a data storage method in a distributed storage system. The distributed storage system includes M storage nodes Nj, and j is an integer ranging from 1 to M. In this method, the storage node Nj receives data of a strip SUNj in a stripe SN, where the stripe SN includes M strips SUNj; the storage node Nj receives data of a strip SUKj in a stripe SK, where a logical address of the data of the strip SUKj is the same as a logical address of the data of the strip SUNj, N is different from K, and the stripe SK includes M strips SUKj; and the storage node Nj generates a record, where the record is used to indicate that the data of the strip SUNj reaches the storage node Nj before the data of the strip SUKj. In this data storage method, the storage node Nj receives the data of the strip SUNj and the data of the strip SUKj directly from a client and indicates, by using the record, that the data of the strip SUNj reaches the storage node Nj before the data of the strip SUKj, so that when the storage node Nj stores a plurality of pieces of data of the same logical address, data of a latest version may be determined based on the record. In this data storage method, a primary storage node is not required, and therefore, data interaction between storage nodes is reduced, and write performance of the distributed storage system is improved.
In implementation, the record in the first aspect of this application may indicate, based on an identifier of the strip SUNj and an identifier of the strip SUKj, that the data of the strip SUNj reaches the storage node Nj before the data of the strip SUKj. The data of the strip SUNj and the data of the strip SUKj may be sent by a same client or by different clients.
With reference to the first aspect of this application, in a possible implementation, the storage node Nj backs up the record to one or more other storage nodes in the M storage nodes, to improve reliability of the record. When the storage node Nj becomes faulty, the record is obtained from a storage node in which the record is backed up, and between the restored data of the strip SUNj and the restored data of the strip SUKj, the data of the strip SUKj is determined as latest data based on the record.
With reference to the first aspect of this application and the foregoing possible implementations, in a possible implementation, the storage node Nj receives a read request sent by a third client, where the read request includes the logical address; the storage node Nj determines, by querying the record based on the logical address, that the data of the strip SUKj is latest data; and the storage node Nj returns the data of SUKj to the third client.
With reference to the first aspect of this application and the foregoing possible implementations, in a possible implementation, the strip SUN and the strip SUKj are allocated from the storage node Nj by a stripe metadata server.
A second aspect of this application provides a storage node in a distributed storage system. The distributed storage system includes M storage nodes Nj, where j is an integer ranging from 1 to M. The storage node serves as the storage node Nj, and includes a plurality of units that are configured to implement the first aspect of this application and various implementations of the first aspect.
A third aspect of this application provides a storage node in a distributed storage system. The distributed storage system includes M storage nodes Nj, where j is an integer ranging from 1 to M. The storage node serves as the storage node Nj, and includes an interface and a processor that communicate with each other. The processor is configured to implement the first aspect of this application and various implementations of the first aspect.
A fourth aspect of this application provides a computer program product. The computer program product includes a computer instruction, and the computer instruction is used by a storage node in a distributed storage system. The distributed storage system includes M storage nodes Nj, where j is an integer ranging from 1 to M. The storage node serves as the storage node Nj, and runs the computer instruction to implement the first aspect of this application and various implementations of the first aspect.
A fifth aspect of this application provides a data storage method in a distributed storage system. The distributed storage system includes M storage nodes Nj, where j is an integer ranging from 1 to M. The method includes: A first client receives a first write request, where the first write request includes first data and a logical address; the first client divides the first data into data of one or more strips SUNj in a stripe SN, where the stripe SN includes M strips SUNj; the first client sends the data of the one or more strips SUNj to the storage node Nj; a second client receives a second write request, where the second write request includes second data and the logical address; the second client divides the third data into data of one or more strips SUKj of a stripe SK, where N is different from K, and the stripe SK includes M strips SUKj; and the second client sends the data of the one or more strips SUKj to the storage node Nj.
With reference to the fifth aspect of this application, in a possible implementation, the strip SUNj and the strip SUKj are allocated from the storage node Nj by a stripe metadata server.
With reference to the fifth aspect of this application, in a possible implementation, each piece of the data of the one or more strips SUNj further includes data strip status information, and the data strip status information is used to indicate whether each data strip in the stripe SN is empty.
A distributed storage system in an embodiment of the present invention includes a Huawei® Fusionstorage® series and an OceanStor® 9000 series. For example, as shown in
The server in the distributed storage system includes a structure shown in
A client in the distributed storage system writes data into the distributed storage system based on a write request from a host or reads data from the distributed storage system based on a read request from a host. The server shown in this embodiment of the present invention may serve as the client. In addition, the client may alternatively be a device independent of the server shown in
In an embodiment of the present invention, a distributed block storage system is used as an example. A client provides a block protocol access interface, so that the client provides a distributed block storage access point service. A host may access a storage resource in a storage resource pool in the distributed block storage system by using the client. Usually, the block protocol access interface is configured to provide a LUN for the host. When a distributed block storage system program runs on a server including a hard disk, the server serves as a storage node to store data received by the client. For example, for the server, one hard disk may serve as one storage node by default, in other words, when the server includes a plurality of hard disks, the server may serve as a plurality of storage nodes. In another implementation, when a distributed block storage system program runs on a server, the server serves as one storage node. This is not limited in this embodiment of the present invention. Therefore, for a structure of the storage node, refer to
Based on a reliability requirement of the distributed block storage system, data reliability may be improved by using an erasure coding (Erasure Coding, EC) algorithm. For example, a 3+1 mode is used, three data strips and one check strip constitute a stripe. In this embodiment of the present invention, data is stored in the partition in a form of a stripe. One partition includes R stripes Si, where i is an integer ranging from 1 to R. In this embodiment of the present invention, P2 is used as an example for description.
In the distributed block storage system, fragment management is performed on the hard disk in a unit of 8 kilobytes (KB), and allocation information of each 8 KB fragment is recorded in a metadata management area of the hard disk. The fragments of the hard disk constitute the storage resource pool. The distributed block storage system includes a stripe metadata server. A implementation may be that a stripe management program runs on one or more servers in the distributed block storage system. The stripe metadata server allocates a stripe to the partition. The partition view shown in
To reduce a quantity of strip identifiers managed by the stripe metadata server, the stripe metadata server allocates a version number to an identifier of a strip in a stripe. After a stripe is released, a version number of a strip identifier of a strip in the released stripe is updated, so that the strip identifier is used as a strip identifier of a strip in a new stripe. The stripe metadata server pre-allocates a strip SUij to a stripe Si, so that waiting time can be reduced when the client writes data, thereby improving write performance of the distributed block storage system. In this embodiment of the present invention, the strip SUij in the stripe Si has a unique identifier in the distributed block storage system.
In this embodiment of the present invention, a logical unit allocated by the distributed block storage system is connected to the client, to perform a data access operation. The logical unit is also referred to as a logical unit number (LUN). In the distributed block storage system, one logical unit may be connected to only one client, or one logical unit may be connected to a plurality of clients, that is, a plurality of clients share one logical unit. The logical unit is provided by a storage resource pool shown in
In an embodiment of the present invention, as shown in
Step 601: A first client receives a first write request.
The first write request includes first data and a logical address. In a distributed block storage system, the distributed block storage system provides an LBA of a LUN. The logical address is used to indicate a writing position of the first data in the LUN.
Step 602: The first client determines that the logical address is distributed in a partition P.
In this embodiment of the present invention, a partition P2 is used as an example. With reference to
Step 603: The first client obtains a stripe SN from R stripes, where N is an integer ranging from 1 to R.
A stripe metadata server manages a correspondence between a partition and a stripe and a relationship between a strip in a stripe and a storage node. An implementation in which the first client obtains the stripe SN from the R stripes is as follows: The first client determines that the logical address is distributed in the partition P2, and the first client queries the stripe metadata server to obtain the stripe SN in the R stripes included in the partition P2. The logical address is an address used for writing data by the client in the distributed block storage system. Therefore, that the logical address is distributed in the partition P represents a same meaning as that the first data is distributed in the partition P. Another implementation in which the first client obtains the stripe SN from the R stripes may be as follows: The first client obtains the stripe SN from a stripe in the R stripes that is allocated to the first client. A client may store a mapping relationship between a partition and a stripe. The client may cache a stripe allocated by the stripe metadata server.
Step 604: The first client divides the first data into data of one or more strips SUNj in the stripe SN.
The stripe SN includes a strip. The first client receives the first write request, caches the first data included in the first write request, and divides the cached data based on a size of the strip in the stripe. For example, the first client divides the data based on a length of the strip in the stripe to obtain data of a strip size, performs a modulo operation on a quantity M (for example, 4) of storage nodes in the partition based on the logical address of the data of the strip size, to determine a location of the data of the strip size in the stripe, namely, a corresponding strip SUNj, and further determines, based on the partition view, a storage node Nj corresponding to the strip SUN. Therefore, data of strips with a same logical address is distributed in a same storage node. For example, the first data is divided into data of one or more strips SUN. In this embodiment of the present invention, P2 is used as an example. With reference to
In this embodiment of the present invention, the stripe SN includes four strips, namely, three data strips and one check strip. When the first client caches data for more than a period of time, the first client needs to write the data into the storage node. When the data cannot fill the data stripe, for example, there is only the data of the strip SUN1 and the data of SUN2 that are obtained through division of the first data, the check strip is generated based on the data of SUN1 and the data of SUN2. Optionally, data of a valid data strip SUNj includes data strip status information of the stripe SN, and the valid data strip SUNj is a strip that is not empty. In this embodiment of the present invention, the data of the valid data strip SUN1 and the data of the valid data strip SUN2 both include the data strip status information of the stripe SN, and the data strip status information is used to indicate whether each data strip in the stripe SN is empty. For example, 1 is used to indicate that the data strip is not empty, and 0 is used to indicate that the data strip is empty. In this case, the data strip status information included in the data of SUN1 is 110, and the data strip status information included in the data of SUN2 is 110, which indicates that SUN1 is not empty, SUN2 is not empty, and SUN3 is empty. The data of the check strip SUN4 generated based on the data of SUN1 and the data of SUN2 includes check data of the data strip status information. Because SUN3 is empty, the first client does not need to replace the data of SUN3 with all-0 data and write the all-0 data into the storage node N3, thereby reducing an amount of written data. When reading the stripe SN, the first client determines, based on the data strip status information of the stripe SN included in the data of the data strip SUN1 or the data of the data strip SUN2, that SUN3 is empty.
When SUN3 is not empty, the data strip status information included in the data of SUN1, the data of SUN2, and the data of SUN3 in this embodiment of the present invention is 111, and the data of the check strip SUN4 generated based on the data of SUN1, the data of SUN2, and the data of SUN3 includes the check data of the data strip status information.
In this embodiment of the present invention, the data of the data strip SUN further includes metadata, such as an identifier of the data strip SUN and a logical address of the data of the data strip SUN.
Step 605: The first client sends the data of the one or more strips SUN to the storage node Nj.
In this embodiment of the present invention, the first client sends the data of SUN1 obtained through division of the first data to a storage node N1, and sends the data of SUN2 obtained through division of the first data to a storage node N2. The first client may concurrently send the data of the strip SUNj in the stripe SN to the storage node Nj, without requiring a primary storage node, so that data interaction between storage nodes is reduced, write concurrency is improved, and write performance of the distributed block storage system is improved.
Further, when only the first client writes data into the LUN provided by the distributed block storage system, the first client receives a second write request. The second write request includes second data and the logical address described in
In another implementation, when a plurality of clients such as the first client and a second client write data into the LUN provided by the distributed block storage system, the second client may perform the foregoing operation of the second write request performed by the first client.
Further, when one logical unit is connected to a plurality of clients such as the first client and the second client, the second client receives a third write request, and the third write request includes third data and the logical address described in
In the embodiment shown in
Corresponding to the embodiment of the first client shown in
Step 801: The storage node Nj stores data of a strip SUNj in a stripe SN.
With reference to the embodiment shown in
Step 802: The storage node Nj stores data of a strip SUKj in a stripe SK.
With reference to the embodiment shown in
Step 803: The storage node Nj generates a record, where the record is used to indicate that the data of the strip SUN reaches the storage node Nj before the data of the strip SUKj.
In an implementation of this embodiment of the present invention, the record indicates, based on an identifier of the strip SUN and an identifier of the strip SUKj, that the data of the strip SUNj reaches the storage node Nj before the data of the strip SUKj, so that data consistency is ensured based on the record. That is, the identifier of the strip SUNj and the identifier of the strip SUKj are used as an entry. In the entry, a sequence of the data of the two strips in reaching the storage node Nj may be represented by a sequence of the identifiers of the two strips. In another implementation, in the entry, a sequence of the data of the two strips in reaching the storage node Nj may alternatively be represented by a combination of the identifiers of the two strips and other symbols.
In this embodiment of the present invention, the storage node Nj backs up the record to one or more other storage nodes in M storage nodes. For example, using the storage node N1 as an example, the record may be backed up to one or more of the storage node N2, the storage node N3, and the storage node N4, so that reliability of the record can be improved, and the record is prevented from being lost when the storage node N1 becomes faulty. The storage node Nj stores an identifier of the storage node in which the record is backed up. In another implementation, because the stored identifier of the storage node in which the record is backed up is lost when the storage node Nj becomes faulty, the data of the strip SUKj between the restored data of the strip SUNj and the restored data of the strip SUKj cannot be determined as latest data. To prevent this, a backup relationship of the record between the M storage nodes may be alternatively determined by a stripe metadata server.
In this embodiment of the present invention, the storage node Nj further records a mapping between the logical address of the data of the strip SUN and the identifier of the strip SUNj, and a mapping between the logical address of the data of the strip SUKj and the identifier of the strip SUKj. For ease of querying the record stored by the storage node Nj, the logical address of the data of the strip may be used as an index to organize the record. The storage node Nj stores the data of the strip SUNj and records a mapping between the identifier of the strip SUNj and an address for storing the data of the strip SUNj. The storage node Nj stores the data of the strip SUKj and records a mapping between the identifier of the strip SUKj and an address for storing the data of the strip SUKj. In this embodiment of the present invention, for a strip storing check data, namely, a check strip, the storage node needs to record a mapping between an identifier of the check strip and an address for storing the data of the check strip, and does not need to record a mapping between a logical address of the data of the check strip and the identifier of the check strip, because the data of the check strip has no logical address.
In this embodiment of the present invention, the storage node Nj stores a mapping between a logical address of data of a strip and an identifier of the strip in a cache according to a sequence of receiving the data of the strip. Therefore, the storage node Nj generates the record based on the mapping. In this embodiment of the present invention, a record is not required to be generated immediately after the data of the strip is received, and generating the record may be performed as a background task, so that performance of processing the data of the strip by the storage node Nj is improved. When the storage node is configured to store the data of the check strip, namely, the check data, the storage node Nj stores, in the cache according to a sequence of receiving the data of the check strip, the mapping between the identifier of the strip and the logical address of the data of the check strip stored in the storage node Nj, and the storage node Nj generates the record based on the mapping. In another implementation, the storage node Nj stores the data of the check strip in the cache according to the sequence of receiving the data of the check strip, and the storage node Nj generates the record based on a sequence of data in the cache.
In this embodiment of the present invention, the storage node Nj receives a read request sent by the client, and the read request includes a logical address. Because the logical address of the data of the strip SUNj is the same as the logical address of the data of the strip SUKj, the storage node Nj further records a mapping between the logical address of the data of the strip SUNj and the identifier of the strip SUNj, and a mapping between the logical address of the data of the strip SUKj and the identifier of the strip SUKj. Therefore, latest data, namely, data of a latest version, between the data of the strip SUNj and the data of the strip SUKj needs to be determined, to ensure data consistency based on the record. The storage node Nj determines, by querying the stored record based on the logical address, that the data of the strip SUKj is the latest data. The client may be the first client or the second client, or may be a client other than the first client and the second client.
In this embodiment of the present invention, when the storage node Nj becomes faulty and the data of the strip SUNj and the data of the strip SUKj are lost, the stripe metadata server restores the data of the strip SUNj based on data of other strips of the stripe SN on M−1 storage nodes, and restores the data of the strip SUKj based on data of other strips of the stripe SK on the M−1 storage nodes. For example, when the storage node N1 becomes faulty and the data of the strip SUN1 and the data of the strip SUK1 are lost, the data of the strip SUN1 and the data of the strip SUK1 are restored in the foregoing manner. In another implementation, restoring the data of the strip may be implemented by the client. The storage node Nj obtains the record from another storage node in which the record is backed up, and determines, between the restored data of the strip SUNj and the restored data of the strip SUKj based on the record, that the data of the strip SUKj is latest data, to ensure data consistency based on the record. The storage node Nj queries the backup relationship of the record between the M storage nodes from the stripe metadata server, determines an identifier of a storage node that stores a record of the storage node Nj, and obtains the record of the storage node Nj from the storage node based on the identifier of the storage node.
With reference to various implementations of the embodiments of the present invention, an embodiment of the present invention provides a storage node in a distributed storage system as a storage node Nj in the distributed storage system. As shown in
Further, the storage node Nj shown in
Further, the storage node Nj shown in
Further, the storage node Nj shown in
For an implementation of the storage node shown in
An embodiment of the present invention further provides a client, including a plurality of units that are configured to implement an operation of the client in this embodiment of the present invention. For an implementation of the client, refer to the storage node such as the storage node Nj in the embodiments of the present invention. In another implementation, a unit implemented by the client may be a software module that may run on a server to make the storage node complete various implementations described in the embodiments of the present invention.
Correspondingly, the embodiments of the present invention further provide a computer readable storage medium and a computer program product. The computer readable storage medium and the computer program product include computer instructions used to implement various solutions described in the embodiments of the present invention.
Identifiers used to describe the stripe, the data strip, the check strip, and the storage node in the embodiments of the present invention are merely intended to describe the embodiments of the present invention more clearly, and a similar identifier is not required in actual product implementation. Therefore, the identifiers used to describe the stripe, the data strip, the check strip, and the storage node in the embodiments of the present invention are not intended to limit the present invention.
In another implementation of this embodiment of the present invention, a corresponding strip and storage node may not be queried based on a partition.
In the several embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the unit division in the described apparatus embodiment is merely logical function division and may be another division in actual implementation. For example, a plurality of units or components may be combined or may be integrated into another system, or some features may be ignored or not be performed. In addition, the displayed or discussed mutual couplings or direct couplings or communications connections may be implemented by using some interfaces. The indirect couplings or communications connections between the apparatuses or units may be implemented in electrical, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected based on actual requirements to achieve the objectives of the solutions in the embodiments.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
This application is a continuation of International Application No. PCT/CN2018/123349, filed on Dec. 25, 2018, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2018/123349 | Dec 2018 | US |
Child | 17358682 | US |