The present disclosure relates to the field of information technologies, and in particular, to a method and an apparatus for storing data in a distributed block storage system, and a computer readable storage medium.
A distributed block storage system includes a partition, the partition includes storage nodes and stripes, each stripe in the partition includes a plurality of strips, and a storage node in the partition corresponds to a strip in the stripe. That is, a storage node in the partition provides storage space to a strip in the stripe. Usually, as shown in
This application provides a method and an apparatus for storing data in a distributed block storage system, where a primary storage node is not required such that data exchange between storage nodes is reduced, and write performance of a distributed block storage system is improved.
A first aspect of this application provides a method for storing data in a distributed block storage system. The distributed block storage system includes a partition P, the partition P includes M storage nodes Nj and R stripes Si, and each stripe includes strips SUij, where j is every integer from 1 to M, and i is integer from 1 to R. In the method, a first client receives a first write request, where the first write request includes first data and a logical address, the first client determines that the logical address is located in the partition P, and the first client obtains a stripe SN from the R stripes included in the partition P, where N is an integer from 1 to R, and the first client divides the first data to obtain data of one or more strips SUNj in the stripe SN, and sends the data of the one or more strips SUNj to a storage node Nj. The client obtains stripes based on a partition, divides data into data of strips of a stripe, and sends the data of the strips to corresponding storage nodes without needing a primary storage node in order to reduce data exchange between the storage nodes, and the data of the strips of the stripe is concurrently written to the corresponding storage nodes in order to improve write performance of the distributed block storage system. Further, a physical address of a strip SUij in each stripe at a storage node Nj may be assigned by a stripe metadata server in advance. The stripe may be a stripe generated based on an erasure coding (EC) algorithm, or may be a stripe generated based on a multi-copy algorithm. When the stripe is a stripe generated based on the EC algorithm, the strips SUij in the stripe include a data strip and a check strip. When the stripe is a stripe generated based on the multi-copy algorithm, all the strips SUij in the stripe are data strips, and the data strips have same data. Data of a data strip SUNj further includes metadata such as an identifier of the data strip SUNj, and a logical address of the data of the data strip SUNj.
With reference to the first aspect of this application, in a first possible implementation of the first aspect, the first client receives a second write request, where the second write request includes second data and the logical address, that is, the logical address of the first data is the same as the logical address of the second data, the first client determines that the logical address is located in the partition P, and the first client obtains a stripe SY from the R stripes included in the partition P, where Y is an integer from 1 to R, and N is different from Y, the first client divides the second data to obtain data of one or more strips SUYj in the stripe SY, and sends the data of the one or more strips SUYj to a storage node Nj. Data of the data strip SUYj further includes metadata such as an identifier of the data strip SUYj, and a logical address of the data of the data strip SUYj.
With reference to the first aspect of this application, in a second possible implementation of the first aspect, a second client receives a third write request, where the third write request includes third data and the logical address, that is, the logical address of the first data is the same as the logical address of the third data, the second client determines that the logical address is located in the partition P, and the second client obtains a stripe SK from the R stripes included in the partition P, where K is an integer from 1 to R, and N is different from K, the second client divides the third data to obtain data of one or more strips SUKj in the stripe SK, and sends the data of the one or more strips SUKj to a storage node Nj. Data of the data strip SUKj further includes metadata such as an identifier of the data strip SUKj, and a logical address of the data of the data strip SUKj. In the distributed block storage system, the first client and the second client may access the same logical address.
With reference to the first aspect of this application, in a third possible implementation of the first aspect, each piece of the data of the one or more strips SUNj includes at least one of an identifier of the first client and a time stamp TPN at which the first client obtains the stripe SN. A storage node of the distributed block storage system may determine, based on the identifier of the first client in the data of the strip SUNj, that the strip is written by the first client, and the storage node of the distributed block storage system may determine, based on the time stamp TPN at which the first client obtains the stripe SN and that is in the data of the strip SUNj, a sequence in which the first client writes strips.
With reference to the first possible implementation of the first aspect of this application, in a fourth possible implementation of the first aspect, each piece of the data of the one or more strips SUYj includes at least one of an identifier of the first client and a time stamp TPY at which the first client obtains the stripe SY. A storage node of the distributed block storage system may determine, based on the identifier of the first client in the data of the strip SUYj, that the strip is written by the first client, and the storage node of the distributed block storage system may determine, based on the time stamp TPY at which the first client obtains the stripe SY and that is in the data of the strip SUYj, a sequence in which the first client writes strips.
With reference to the second possible implementation of the first aspect of this application, in a fifth possible implementation of the first aspect, each piece of the data of the one or more strips SUKj includes at least one of an identifier of the second client and a time stamp TPK at which the second client obtains the stripe SK. A storage node of the distributed block storage system may determine, based on the identifier of the second client in the data of the strip SUKj, that the strip is written by the second client, and the storage node of the distributed block storage system may determine, based on the time stamp TPK at which the first client obtains the stripe SK and that is in the data of the strip SUKj, a sequence in which the second client writes strips.
With reference to the first aspect of this application, in a sixth possible implementation of the first aspect, the strip SUij in the stripe Si is assigned by a stripe metadata server from the storage node Nj based on a mapping between the partition P and the storage node Nj included in the partition. The stripe metadata server assigns a physical storage address to the strip SUij in the stripe Si from the storage node Nj in advance, and a waiting time of a client before the client writes data may be reduced, thereby improving write performance of the distributed block storage system.
With reference to any one of the first aspect of this application or the first to the sixth possible implementations of the first aspect, in a seventh possible implementation of the first aspect, each piece of the data of the one or more strips SUNj further includes data strip status information, and the data strip status information is used to identify whether each data strip of the stripe SN is empty such that it is not required that all-0 data be used to replace the data of the strip whose data is empty and be written to the storage node, thereby reducing a data write amount of the distributed block storage system.
A second aspect of this application further provides a method for storing data in a distributed block storage system. The distributed block storage system includes a partition P, the partition P includes M storage nodes Nj and R stripes Si, and each stripe includes strips SUij, where j is every integer from 1 to M, and i is every integer from 1 to R. In the method, a storage node Nj receives data of a strip SUNj in a stripe SN sent by a first client, where the data of the strip SUNj is obtained by dividing first data by the first client, the first data is obtained by receiving a first write request by the first client, the first write request includes first data and a logical address, the logical address is used to determine that the first data is located in the partition P, and the storage node Nj stores, based on a mapping between an identifier of the strip SUNj and a first physical address of the storage node Nj, the data of SUNj at the first physical address. Because the logical address is an address at which the data written by the client is stored in the distributed block storage system, that the logical address is located in the partition P and that the first data is located in the partition P have a same meaning. The storage node Nj receives only the data of the strip SUNj sent by the client. Therefore, the distributed block storage system does not need a primary storage node in order to reduce data exchange between storage nodes, and data of strips of a stripe is concurrently written to the corresponding storage nodes in order to improve write performance of the distributed block storage system. Further, a physical address of a strip SUij in each stripe at a storage node Nj may be assigned by a stripe metadata server in advance. Therefore, the first physical address of the strip SUNj at the storage node Nj is also assigned by the stripe metadata server in advance. The stripe may be a stripe generated based on an EC algorithm, or may be a stripe generated based on a multi-copy algorithm. When the stripe is a stripe generated based on the EC algorithm, the strips SUij in the stripe include a data strip and a check strip. When the stripe is a stripe generated based on the multi-copy algorithm, all the strips SUij in the stripe are data strips, and the data strips have same data. Data of a data strip SUNj further includes metadata such as an identifier of the data strip SUNj, and a logical address of the data of the data strip SUNj.
With reference to the second aspect of this application, in a first possible implementation of the second aspect, the method further includes assigning, by the storage node Nj, a time stamp TPNj to the data of the strip SUNj, where the time stamp TPNj may be used as a reference time stamp at which the data of the strip in the stripe SN is recovered after another storage node is faulty.
With reference to the second aspect of this application or the first possible implementation of second aspect, in a second possible implementation of the second aspect, the method further includes establishing, by the storage node Nj, a correspondence between a logical address of the data of the strip SUNj and the identifier of the strip SUNj such that the client accesses, using the logical address, the data of the strip SUNj stored in the storage node Nj in the distributed block storage system.
With reference to the second aspect of this application or the first or the second possible implementation of the second aspect, in a third possible implementation of the second aspect, the data of SUNj includes at least one of an identifier of the first client and a time stamp TPN at which the first client obtains the stripe SN. The storage node Nj may determine, based on the identifier of the first client in the data of the strip SUNj, that the strip is written by the first client, and the storage node Nj may determine, based on the time stamp TPN at which the first client obtains the stripe SN and that is in the data of the strip SUNj, a sequence in which the first client writes strips.
With reference to any one of the second aspect of this application or the first to the third possible implementations of the second aspect, in a fourth possible implementation of the second aspect, the method further includes receiving, by the storage node Nj, data of a strip SUYj in a stripe SY sent by the first client, where the data of the strip SUYj is obtained by dividing second data by the first client, the second data is obtained by receiving a second write request by the first client, the second write request includes second data and the logical address, the logical address is used to determine that the second data is located in the partition P, that is, the logical address of the first data is the same as the logical address of the second data, and storing, by the storage node Nj based on a mapping between an identifier of a strip SUYj and a second physical address of the storage node Nj, the data of SUYj at the second physical address. Because the logical address is an address at which the data written by the client is stored in the distributed block storage system, that the logical address is located in the partition P and that the second data is located in the partition P have a same meaning. Data of a data strip SUYj further includes metadata such as an identifier of the data strip SUYj, and a logical address of the data of the data strip SUYj.
With reference to the fourth possible implementation of the second aspect of this application, in a fifth possible implementation of the second aspect, the method further includes assigning, by the storage node Nj, a time stamp TPYj to the data of the strip SUYj. The time stamp TPYj may be used as a reference time stamp at which the data of the strip in the stripe SY is recovered after another storage node is faulty.
With reference to the fourth or the fifth possible implementation of the second aspect of this application, in a sixth possible implementation of the second aspect, the method further includes establishing, by the storage node Nj, a correspondence between a logical address of the data of the strip SUYj and an identifier of the strip SUYj such that the client accesses, using the logical address, the data of the strip SUYj stored in the storage node Nj in the distributed block storage system.
With reference to any one of the fourth to the sixth possible implementations of the second aspect of this application, in a seventh possible implementation of the second aspect, the data of SUYj includes at least one of the identifier of the first client and a time stamp TPY at which the first client obtains the stripe SY. The storage node Nj may determine, based on the identifier of the first client in the data of the strip SUYj, that the strip is written by the first client, and the storage node Nj may determine, based on the time stamp TPY at which the first client obtains the stripe SY and that is in the data of the strip SUYj, a sequence in which the first client writes strips.
With reference to the second aspect of this application or the first or the second possible implementation of the second aspect, in an eighth possible implementation of the second aspect, the method further includes receiving, by the storage node Nj, data of a strip SUKj in a stripe SK sent by a second client, where the data of the strip SUKj is obtained by dividing third data by the second client, the third data is obtained by receiving a third write request by the second client, the third write request includes the third data and the logical address, the logical address is used to determine that the third data is located in the partition P, that is, the logical address of the first data is the same as the logical address of the third data, and storing, by the storage node Nj based on a mapping between an identifier of a strip SUKj and a third physical address of the storage node Nj, the data of SUKj at the third physical address. Because the logical address is an address at which the data written by the client is stored in the distributed block storage system, that the logical address is located in the partition P and that the third data is located in the partition P have a same meaning. In the distributed block storage system, the first client and the second client may access the same logical address. Data of a data strip SUKj further includes metadata such as an identifier of the data strip SUKj, and a logical address of the data of the data strip SUKj.
With reference to the eighth possible implementation of the second aspect, in a ninth possible implementation of the second aspect, the method further includes assigning, by the storage node Nj, a time stamp TPKj to the data of the strip SUKj. The time stamp TPKj may be used as a reference time stamp at which the data of the strip in the stripe SK is recovered after another storage node is faulty.
With reference to the eighth or the ninth possible implementation of the second aspect of this application, in a tenth possible implementation of the second aspect, the method further includes establishing, by the storage node Nj, a correspondence between a logical address of the data of the strip SUKj and an identifier of the strip SUKj such that the client accesses, using the logical address, the data of the strip SUKj stored in the storage node Nj in the distributed block storage system.
With reference to any one of the eighth to the tenth possible implementations of the second aspect of this application, in an eleventh possible implementation of the second aspect, the data of SUKj includes at least one of an identifier of the second client and a time stamp TPK at which the second client obtains the stripe SK. The storage node Nj may determine, based on the identifier of the second client in the data of the strip SUKj, that the strip is written by the second client, and the storage node Nj may determine, based on the time stamp TPK at which the second client obtains the stripe SK and that is in the data of the strip SUKj, a sequence in which the second client writes strips.
With reference to the second aspect of this application, in a twelfth possible implementation of the second aspect, the strip SUij in the stripe Si is assigned by a stripe metadata server from the storage node Nj based on a mapping between the partition P and the storage node Nj included in the partition P. The stripe metadata server assigns a physical storage address to the strip SUij in the stripe Si from the storage node Nj in advance, and a waiting time of a client before the client writes data may be reduced, thereby improving write performance of the distributed block storage system.
With reference to any one of the second aspect of this application or the first to the twelfth possible implementations of the second aspect, in a thirteenth possible implementation of the second aspect, each piece of data of the one or more strips SUNj further includes data strip status information, and the data strip status information is used to identify whether each data strip of the stripe SN is empty such that it is not required that all-0 data be used to replace the data of the strip whose data is empty and be written to the storage node, thereby reducing a data write amount of the distributed block storage system.
With reference to the ninth possible implementation of the second aspect, in a fourteenth possible implementation of the second aspect, after the storage node Nj is faulty, a new storage node recovers the data of the strip SUNj and the data of SUKj based on the stripe SN and the stripe SK respectively, the new storage node obtains a time stamp TPNX of data of a strip SUNX in a storage node NX as a reference time stamp of the data of the strip SUNj, and obtains a time stamp TPKX of data of a strip SUKX in the storage node NX as a reference time stamp of the data of the strip SUKj, and the new storage node eliminates, from a buffer based on the time stamp TPNX and the time stamp TPKX, strip data, corresponding to an earlier time, in the data of the strip SUNj and the data of SUKj, where X is any integer from 1 to M other than j. Latest strip data is reserved in the storage system, thereby saving buffer space.
With reference to the seventh possible implementation of the second aspect, in a fifteenth possible implementation of the second aspect, after the storage node Nj is faulty, a new storage node recovers the data of the strip SUNj and the data of SUYj based on the stripe SN and the stripe SY respectively, where the data of the strip SUNX includes the time stamp TPN, and the data of the strip SUYj includes the time stamp TPY, and the new storage node eliminates, from a buffer based on the time stamp TPN and the time stamp TPY, the earlier one of the data of the strip SUNj and the data of SUYj, where X is any integer from 1 to M other than j. Latest strip data of the same client is reserved in the storage system, thereby saving buffer space.
With reference to the distributed block storage system in any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect, a third aspect of this application further provides an apparatus for writing data in a distributed block storage system. The apparatus for writing data includes a plurality of units configured to perform any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect.
With reference to the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect, a fourth aspect of this application further provides an apparatus for storing data in a distributed block storage system. The apparatus for storing data includes a plurality of units configured to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.
A fifth aspect of this application further provides the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect. A storage node Nj in the distributed block storage system is configured to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.
A sixth aspect of this application further provides a client, applied to the distributed block storage system in any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect. The client includes a processor and an interface, the processor communicates with the interface, and the processor is configured to perform any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect.
A seventh aspect of this application further provides a storage node, applied to the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect. The storage node used as a storage node Nj includes a processor and an interface, the processor communicates with the interface, and the processor is configured to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.
An eighth aspect of this application further provides a computer readable storage medium, applied to the distributed block storage system in any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect. The computer readable storage medium includes a computer instruction, used to enable a client to perform any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect.
A ninth aspect of this application further provides a computer readable storage medium, applied to the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect. The computer readable storage medium includes a computer instruction, used to enable a storage node to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.
A tenth aspect of this application further provides a computer program product, applied to the distributed block storage system in any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect. The computer program product includes a computer instruction, used to enable a client to perform any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect.
An eleventh aspect of this application further provides a computer program product, applied to the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect. The computer program product includes a computer instruction, used to enable a storage node to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.
A distributed block storage system in the embodiments of the present disclosure is, for example, Huawei® Fusionstorage® series. For example, as shown in
A server of the distributed block storage system includes a structure shown in
In the structure shown in
Based on a reliability requirement of the distributed block storage system, data reliability can be improved using an EC algorithm such as using a 3+1 mode, that is, a stripe includes three data strips and one check strip. In the embodiments of the present disclosure, a partition stores data in a stripe form, and one partition includes R stripes Si, where i is every integer from 1 to R. In the embodiments of the present disclosure, P2 is used as an example for description.
The distributed block storage system performs fragment management on a hard disk using 4 kilobytes (KB) as a unit, and records assignment information of each fragment of 4 KB in a metadata management area of the hard disk, and a storage resource pool includes fragments of the hard disk. The distributed block storage system includes a stripe metadata server, and in a specific implementation, a stripe metadata management program may be run on one or more servers in the distributed block storage system. The stripe metadata server assigns a stripe to a partition. Still using the partition view shown in
In the embodiments of the present disclosure, a logical unit assigned by the distributed block storage system is mounted to the client, thereby performing a data access operation. The logical unit is also referred to as a logical unit number (LUN). In the distributed block storage system, one logical unit may be mounted to only one client, or one logical unit may be mounted to a plurality of clients, that is, a plurality of clients share one logical unit. The logical unit is provided by the storage resource pool shown in
In an embodiment of the present disclosure, as shown in
Step 601: The first client receives a first write request, where the first write request includes first data and a logical address.
In a distributed block storage system, the first client may be a VM or a server. An application program is run on the first client, and the application program accesses a logical unit mounted to the first client, for example, sends the first write request to the logical unit. The first write request includes the first data and the logical address, and the logical address is also referred to as a logical block address (LBA). The logical address is used to indicate a write location of the first data in the logical unit.
Step 602: The first client determines that the logical address is located in a partition P.
In this embodiment of the present disclosure, a partition P2 is used as an example. With reference to
Step 603: The first client obtains a stripe SN from R stripes, where N is an integer from 1 to R.
A stripe metadata server manages a correspondence between a partition and a stripe, and a relationship between a strip in a stripe and a storage node. In an implementation in which the first client obtains a stripe SN from R stripes, the first client determines that the logical address is located in the partition P2, and the first client queries the stripe metadata server to obtain a stripe SN of the R stripes included in the partition P2. Because the logical address is an address at which the data written by the client is stored in the distributed block storage system, that the logical address is located in the partition P and that the first data is located in the partition P have a same meaning. In another implementation in which the first client obtains a stripe SN from R stripes, the first client may obtain a stripe SN from stripes that are assigned to the first client and that are of the R stripes.
Step 604: The first client divides the first data into data of one or more strips SUNj in the stripe SN.
The stripe SN includes strips, and the first client receives the first write request, buffers the first data included in the first write request, and divides the buffered data based on a size of a strip in the stripe. For example, the first client performs division based on a length of the strip in the stripe to obtain strip size data, and performs a modulo operation on a quantity M (such as four) of storage nodes in the partition based on a logical address of the strip size data, thereby determining a location of the strip size data in the stripe, that is, a corresponding strip SUNj, and then determines a storage node Nj corresponding to the strip SUNj based on the partition view such that data of strips having a same logical address is located in a same storage node. For example, the first data is divided into data of one or more strips SUNj. In this embodiment of the present disclosure, P2 is used as an example. With reference to
In this embodiment of the present disclosure, the stripe SN includes four strips, that is, three data strips and one check strip. When the first client buffers data and needs to write the data to a storage node after a period of time, but cannot make data of the data strips full, for example, there are only the data of the strip SUN1 and the data of SUN2 obtained by dividing the first data, the check strip is generated based on the data of SUN1 and the data of SUN2. Data of a valid data strip SUNj includes data strip status information of the stripe SN, and the valid data strip SUNj is a strip whose data is not empty. In this embodiment of the present disclosure, both the data of the valid data strip SUN1 and the data of SUN2 include the data strip status information of the stripe SN, and the data strip status information is used to identify whether each data strip of the stripe SN is empty. For example, if 1 is used to indicate that a data strip is not empty, and 0 is used to indicate that a data strip is empty, the data strip status information included in the data of SUN1 is 110, and the data strip status information included in the data of SUN2 is 110, indicating that SUN1 is not empty, SUN2 is not empty, and SUN3 is empty. The data of the check strip SUN4 generated based on the data of SUN1 and the data of SUN2 includes check data of the data strip status information. Because SUN3 is empty, the first client does not need to replace the data of SUN3 with all-0 data and write the all-0 data to a storage node N3, thereby reducing a data write amount. When reading the stripe SN, the first client determines, based on the data strip status information of the stripe SN included in the data of the data strip SUN1 or the data of SUN2, that the data of SUN3 is empty.
When SUN3 is not empty, the data strip status information included in the data of SUN1, the data of SUN2, and the data of SUN3 in this embodiment of the present disclosure is 111, and the data of the check strip SUN4 generated based on the data of SUN1, the data of SUN2, and the data of SUN3 includes check data of the data strip status information.
Further, in this embodiment of the present disclosure, the data of the data strip SUNj further includes at least one of an identifier of the first client and a time stamp TPN at which the first client obtains the stripe SN, that is, includes any one of or a combination of the identifier of the first client and the time stamp TPN at which the first client obtains the stripe SN. When data of a check strip SUNj is generated based on the data of the data strip SUNj, the data of the check strip SUNj also includes check data of at least one of the identifier of the first client and the time stamp TPN at which the first client obtains the stripe SN.
In this embodiment of the present disclosure, the data of the data strip SUNj further includes metadata such as an identifier of the data strip SUNj, and a logical address of the data of the data strip SUNj.
Step 605: The first client sends the data of the one or more strips SUNj to a storage node Nj.
In this embodiment of the present disclosure, the first client sends the data of SUN1 obtained by dividing the first data to the storage node N1, and sends the data of SUN2 obtained by dividing the first data to the storage node N2. The first client may concurrently send the data of the strip SUNj of the stripe SN to the storage node Nj without needing a primary storage node in order to reduce data exchange between the storage nodes, and improve write concurrency, thereby improving write performance of the distributed block storage system.
Further, if a logical unit is mounted to only the first client, the first client receives a second write request, where the second write request includes second data and the logical address that is described in
Further, if a logical unit is mounted to a plurality of clients, for example, mounted to the first client and a second client, the second client receives a third write request, where the third write request includes third data and the logical address that is described in
In other approaches, a client needs to first send data to a primary storage node, and the primary storage node divides the data into data of strips, and sends data of strips other than a strip stored in the primary storage node to corresponding storage nodes. As a result, the primary storage node becomes a data storage bottleneck in a distributed block storage system, and data exchange between the storage nodes is increased. However, in the embodiment shown in
Corresponding to the embodiment of the first client shown in
Step 801: The storage node Nj receives data of a strip SUNj in a stripe SN sent by a first client.
With reference to the embodiment shown in
Step 802: The storage node Nj stores, based on a mapping between an identifier of the strip SUNj and a first physical address of the storage node Nj, the data of SUNj at the first physical address.
A stripe metadata server assigns, in the storage node Nj, the first physical address to the strip SUNj of the stripe SN in a partition in advance based on a partition view, metadata of the storage node Nj stores the mapping between the identifier of the strip SUNj and the first physical address of the storage node Nj, and the storage node Nj receives the data of the strip SUNj, and stores the data of the strip SUNj at the first physical address based on the mapping. For example, the storage node N1 receives the data of SUN1 sent by the first client, and stores the data of SUN1 at the first physical address of N1, and the storage node N2 receives the data of SUN2 sent by the first client, and stores the data of SUN2 at the first physical address of N2.
In the other approaches, a primary storage node needs data sent by a client, divides the data into data of data strips in a stripe, forms data of a check strip based on the data of the data strips, and sends data of strips stored in other storage nodes to corresponding storage nodes. However, in this embodiment of the present disclosure, the storage node Nj receives only the data of the strip SUNj sent by the client without needing a primary storage node in order to reduce data exchange between storage nodes, and data of strips is concurrently written to the corresponding storage nodes in order to improve write performance of a distributed block storage system.
Further, the data of the strip SUNj is obtained by dividing first data, a first write request includes a logical address of the first data, and the data of the strip SUNj used as a part of the first data also has a corresponding logical address. Therefore, the storage node Nj establishes a mapping between the logical address of the data of the strip SUNj and the identifier of the strip SUNj. In this way, the first client still accesses the data of the strip SUNj using the logical address. For example, when the first client accesses the data of the strip SUNj, the first client performs a modulo operation on a quantity M (such as four) of storage nodes in a partition P using the logical address of the data of the strip SUNj, determines that the strip SUNj is located in the storage node Nj, and sends a read request carrying the logical address of the data of the strip SUNj to the storage node Nj, the storage node Nj obtains the identifier of the strip SUNj based on the mapping between the logical address of the data of the strip SUNj and the identifier of the strip SUNj, and the storage node Nj obtains the data of the strip SUNj based on the mapping between the identifier of the strip SUNj and the first physical address of the storage node Nj.
With reference to the embodiment shown in
With reference to the embodiment shown in
Further, the storage node Nj assigns a time stamp TPNj to the data of the strip SUNj, the storage node Nj assigns a time stamp TPKj to the data of the strip SUKj, and the storage node Nj assigns a time stamp TPYj to the data of the strip SUYj. The storage node Nj may eliminate, based on the time stamps, strip data, corresponding to an earlier time, in strip data that has a same logical address in a buffer, and reserve latest strip data, thereby saving buffer space.
With reference to the embodiment shown in
With reference to the embodiment shown in
A time stamp assigned by the storage node Nj may be from a time stamp server, or may be generated by the storage node Nj.
Further, an identifier of the first client included in the data of the data strip SUNj, a time stamp at which the first client obtains the stripe SN, an identifier of the data strip SUNj, a logical address of the data of the data strip SUNj, and data strip status information may be stored at an extension address of a physical address assigned by the storage node Nj to the data strip SUNj, thereby avoiding use of a physical address of the storage node Nj. The extension address of the physical address is a physical address that is invisible beyond a valid physical address capacity of the storage node Nj, and when receiving a read request for accessing the physical address, the storage node Nj reads data in the extension address of the physical address by default. An identifier of the second client included in the data of the data strip SUKj, a time stamp at which the second client obtains the stripe SK, an identifier of the data strip SUKj, a logical address of the data of the data strip SUKj, and data strip status information may also be stored at an extension address of a physical address assigned by the storage node Nj to the data strip SUKj. Likewise, an identifier of the first client included in the data strip SUYj, a time stamp at which the first client obtains the stripe SY, an identifier of the data strip SUYj, a logical address of the data of the data strip SUYj, and data strip status information may also be stored at an extension address of a physical address assigned by the storage node Nj to the data strip SUYj.
Further, the time stamp TPNj assigned by the storage node Nj to the data of the strip SUNj may also be stored at the extension address of the physical address assigned by the storage node Nj to the data strip SUNj. The time stamp TPKj assigned by the storage node Nj to the data of the strip SUKj may also be stored at the extension address of the physical address assigned by the storage node Nj to the data strip SUKj. The time stamp TPYj assigned by the storage node Nj to the data of the strip SUYj may also be stored at the extension address of the physical address assigned by the storage node Nj to the data strip SUYj.
With reference to various implementations of the embodiments of the present disclosure, an embodiment of the present disclosure provides an apparatus 11 for writing data, applied to a distributed block storage system in the embodiments of the present disclosure. As shown in
With reference to various implementations of the embodiments of the present disclosure, an embodiment of the present disclosure provides an apparatus 12 for storing data, applied to a distributed block storage system in the embodiments of the present disclosure. As shown in
With reference to
Further, with reference to
Further, with reference to
Further, with reference to
Further, with reference to
For an implementation of the apparatus 12 for storing data in this embodiment of the present disclosure, refer to a storage node in the embodiments of the present disclosure, such as a storage node Nj. Further, the apparatus 12 for storing data may be a software module, and may be run on a server such that the storage node completes various implementations described in the embodiments of the present disclosure. Alternatively, the apparatus 12 for storing data may be a hardware device. For details, refer to the structure shown in
In this embodiment of the present disclosure, in addition to a stripe generated based on an EC algorithm described above, the stripe may be a stripe generated based on a multi-copy algorithm. When the stripe is a stripe generated based on the EC algorithm, the strips SUij in the stripe include a data strip and a check strip. When the stripe is a stripe generated based on the multi-copy algorithm, all the strips SUij in the stripe are data strips, and the strips SUij have same data.
Correspondingly, an embodiment of the present disclosure further provides a computer readable storage medium and a computer program product, and the computer readable storage medium and the computer program product include a computer instruction used to implement various solutions described in the embodiments of the present disclosure.
In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the unit division in the described apparatus embodiment is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions in the embodiments.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
This application is a continuation of International Patent Application No. PCT/CN2017/106147 filed on Oct. 13, 2017, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2017/106147 | Oct 2017 | US |
Child | 16172264 | US |