Method and Apparatus for Storing Data in Distributed Block Storage System, and Computer Readable Storage Medium

Information

  • Patent Application
  • 20190114076
  • Publication Number
    20190114076
  • Date Filed
    October 26, 2018
    5 years ago
  • Date Published
    April 18, 2019
    5 years ago
Abstract
A method for storing data in a distributed block storage system, where a client generates data of a stripe, and concurrently sends data of strips in the stripe to storage nodes corresponding to the strips in order to reduce data exchange between the storage nodes, and improve write concurrency, thereby improving write performance of the distributed block storage system.
Description
TECHNICAL FIELD

The present disclosure relates to the field of information technologies, and in particular, to a method and an apparatus for storing data in a distributed block storage system, and a computer readable storage medium.


BACKGROUND

A distributed block storage system includes a partition, the partition includes storage nodes and stripes, each stripe in the partition includes a plurality of strips, and a storage node in the partition corresponds to a strip in the stripe. That is, a storage node in the partition provides storage space to a strip in the stripe. Usually, as shown in FIG. 1, a partition includes a primary storage node (a storage node 1), and the primary storage node is configured to receive data sent by a client. Then, the primary storage node selects a stripe, divides the data into data of a strip, and sends data of a strip stored in another storage node to corresponding storage nodes (a storage node 2, a storage node 3, and a storage node 4). The foregoing operation makes the primary storage node easily become a data write bottleneck, increases data exchange between storage nodes, and degrades write performance of the distributed block storage system.


SUMMARY

This application provides a method and an apparatus for storing data in a distributed block storage system, where a primary storage node is not required such that data exchange between storage nodes is reduced, and write performance of a distributed block storage system is improved.


A first aspect of this application provides a method for storing data in a distributed block storage system. The distributed block storage system includes a partition P, the partition P includes M storage nodes Nj and R stripes Si, and each stripe includes strips SUij, where j is every integer from 1 to M, and i is integer from 1 to R. In the method, a first client receives a first write request, where the first write request includes first data and a logical address, the first client determines that the logical address is located in the partition P, and the first client obtains a stripe SN from the R stripes included in the partition P, where N is an integer from 1 to R, and the first client divides the first data to obtain data of one or more strips SUNj in the stripe SN, and sends the data of the one or more strips SUNj to a storage node Nj. The client obtains stripes based on a partition, divides data into data of strips of a stripe, and sends the data of the strips to corresponding storage nodes without needing a primary storage node in order to reduce data exchange between the storage nodes, and the data of the strips of the stripe is concurrently written to the corresponding storage nodes in order to improve write performance of the distributed block storage system. Further, a physical address of a strip SUij in each stripe at a storage node Nj may be assigned by a stripe metadata server in advance. The stripe may be a stripe generated based on an erasure coding (EC) algorithm, or may be a stripe generated based on a multi-copy algorithm. When the stripe is a stripe generated based on the EC algorithm, the strips SUij in the stripe include a data strip and a check strip. When the stripe is a stripe generated based on the multi-copy algorithm, all the strips SUij in the stripe are data strips, and the data strips have same data. Data of a data strip SUNj further includes metadata such as an identifier of the data strip SUNj, and a logical address of the data of the data strip SUNj.


With reference to the first aspect of this application, in a first possible implementation of the first aspect, the first client receives a second write request, where the second write request includes second data and the logical address, that is, the logical address of the first data is the same as the logical address of the second data, the first client determines that the logical address is located in the partition P, and the first client obtains a stripe SY from the R stripes included in the partition P, where Y is an integer from 1 to R, and N is different from Y, the first client divides the second data to obtain data of one or more strips SUYj in the stripe SY, and sends the data of the one or more strips SUYj to a storage node Nj. Data of the data strip SUYj further includes metadata such as an identifier of the data strip SUYj, and a logical address of the data of the data strip SUYj.


With reference to the first aspect of this application, in a second possible implementation of the first aspect, a second client receives a third write request, where the third write request includes third data and the logical address, that is, the logical address of the first data is the same as the logical address of the third data, the second client determines that the logical address is located in the partition P, and the second client obtains a stripe SK from the R stripes included in the partition P, where K is an integer from 1 to R, and N is different from K, the second client divides the third data to obtain data of one or more strips SUKj in the stripe SK, and sends the data of the one or more strips SUKj to a storage node Nj. Data of the data strip SUKj further includes metadata such as an identifier of the data strip SUKj, and a logical address of the data of the data strip SUKj. In the distributed block storage system, the first client and the second client may access the same logical address.


With reference to the first aspect of this application, in a third possible implementation of the first aspect, each piece of the data of the one or more strips SUNj includes at least one of an identifier of the first client and a time stamp TPN at which the first client obtains the stripe SN. A storage node of the distributed block storage system may determine, based on the identifier of the first client in the data of the strip SUNj, that the strip is written by the first client, and the storage node of the distributed block storage system may determine, based on the time stamp TPN at which the first client obtains the stripe SN and that is in the data of the strip SUNj, a sequence in which the first client writes strips.


With reference to the first possible implementation of the first aspect of this application, in a fourth possible implementation of the first aspect, each piece of the data of the one or more strips SUYj includes at least one of an identifier of the first client and a time stamp TPY at which the first client obtains the stripe SY. A storage node of the distributed block storage system may determine, based on the identifier of the first client in the data of the strip SUYj, that the strip is written by the first client, and the storage node of the distributed block storage system may determine, based on the time stamp TPY at which the first client obtains the stripe SY and that is in the data of the strip SUYj, a sequence in which the first client writes strips.


With reference to the second possible implementation of the first aspect of this application, in a fifth possible implementation of the first aspect, each piece of the data of the one or more strips SUKj includes at least one of an identifier of the second client and a time stamp TPK at which the second client obtains the stripe SK. A storage node of the distributed block storage system may determine, based on the identifier of the second client in the data of the strip SUKj, that the strip is written by the second client, and the storage node of the distributed block storage system may determine, based on the time stamp TPK at which the first client obtains the stripe SK and that is in the data of the strip SUKj, a sequence in which the second client writes strips.


With reference to the first aspect of this application, in a sixth possible implementation of the first aspect, the strip SUij in the stripe Si is assigned by a stripe metadata server from the storage node Nj based on a mapping between the partition P and the storage node Nj included in the partition. The stripe metadata server assigns a physical storage address to the strip SUij in the stripe Si from the storage node Nj in advance, and a waiting time of a client before the client writes data may be reduced, thereby improving write performance of the distributed block storage system.


With reference to any one of the first aspect of this application or the first to the sixth possible implementations of the first aspect, in a seventh possible implementation of the first aspect, each piece of the data of the one or more strips SUNj further includes data strip status information, and the data strip status information is used to identify whether each data strip of the stripe SN is empty such that it is not required that all-0 data be used to replace the data of the strip whose data is empty and be written to the storage node, thereby reducing a data write amount of the distributed block storage system.


A second aspect of this application further provides a method for storing data in a distributed block storage system. The distributed block storage system includes a partition P, the partition P includes M storage nodes Nj and R stripes Si, and each stripe includes strips SUij, where j is every integer from 1 to M, and i is every integer from 1 to R. In the method, a storage node Nj receives data of a strip SUNj in a stripe SN sent by a first client, where the data of the strip SUNj is obtained by dividing first data by the first client, the first data is obtained by receiving a first write request by the first client, the first write request includes first data and a logical address, the logical address is used to determine that the first data is located in the partition P, and the storage node Nj stores, based on a mapping between an identifier of the strip SUNj and a first physical address of the storage node Nj, the data of SUNj at the first physical address. Because the logical address is an address at which the data written by the client is stored in the distributed block storage system, that the logical address is located in the partition P and that the first data is located in the partition P have a same meaning. The storage node Nj receives only the data of the strip SUNj sent by the client. Therefore, the distributed block storage system does not need a primary storage node in order to reduce data exchange between storage nodes, and data of strips of a stripe is concurrently written to the corresponding storage nodes in order to improve write performance of the distributed block storage system. Further, a physical address of a strip SUij in each stripe at a storage node Nj may be assigned by a stripe metadata server in advance. Therefore, the first physical address of the strip SUNj at the storage node Nj is also assigned by the stripe metadata server in advance. The stripe may be a stripe generated based on an EC algorithm, or may be a stripe generated based on a multi-copy algorithm. When the stripe is a stripe generated based on the EC algorithm, the strips SUij in the stripe include a data strip and a check strip. When the stripe is a stripe generated based on the multi-copy algorithm, all the strips SUij in the stripe are data strips, and the data strips have same data. Data of a data strip SUNj further includes metadata such as an identifier of the data strip SUNj, and a logical address of the data of the data strip SUNj.


With reference to the second aspect of this application, in a first possible implementation of the second aspect, the method further includes assigning, by the storage node Nj, a time stamp TPNj to the data of the strip SUNj, where the time stamp TPNj may be used as a reference time stamp at which the data of the strip in the stripe SN is recovered after another storage node is faulty.


With reference to the second aspect of this application or the first possible implementation of second aspect, in a second possible implementation of the second aspect, the method further includes establishing, by the storage node Nj, a correspondence between a logical address of the data of the strip SUNj and the identifier of the strip SUNj such that the client accesses, using the logical address, the data of the strip SUNj stored in the storage node Nj in the distributed block storage system.


With reference to the second aspect of this application or the first or the second possible implementation of the second aspect, in a third possible implementation of the second aspect, the data of SUNj includes at least one of an identifier of the first client and a time stamp TPN at which the first client obtains the stripe SN. The storage node Nj may determine, based on the identifier of the first client in the data of the strip SUNj, that the strip is written by the first client, and the storage node Nj may determine, based on the time stamp TPN at which the first client obtains the stripe SN and that is in the data of the strip SUNj, a sequence in which the first client writes strips.


With reference to any one of the second aspect of this application or the first to the third possible implementations of the second aspect, in a fourth possible implementation of the second aspect, the method further includes receiving, by the storage node Nj, data of a strip SUYj in a stripe SY sent by the first client, where the data of the strip SUYj is obtained by dividing second data by the first client, the second data is obtained by receiving a second write request by the first client, the second write request includes second data and the logical address, the logical address is used to determine that the second data is located in the partition P, that is, the logical address of the first data is the same as the logical address of the second data, and storing, by the storage node Nj based on a mapping between an identifier of a strip SUYj and a second physical address of the storage node Nj, the data of SUYj at the second physical address. Because the logical address is an address at which the data written by the client is stored in the distributed block storage system, that the logical address is located in the partition P and that the second data is located in the partition P have a same meaning. Data of a data strip SUYj further includes metadata such as an identifier of the data strip SUYj, and a logical address of the data of the data strip SUYj.


With reference to the fourth possible implementation of the second aspect of this application, in a fifth possible implementation of the second aspect, the method further includes assigning, by the storage node Nj, a time stamp TPYj to the data of the strip SUYj. The time stamp TPYj may be used as a reference time stamp at which the data of the strip in the stripe SY is recovered after another storage node is faulty.


With reference to the fourth or the fifth possible implementation of the second aspect of this application, in a sixth possible implementation of the second aspect, the method further includes establishing, by the storage node Nj, a correspondence between a logical address of the data of the strip SUYj and an identifier of the strip SUYj such that the client accesses, using the logical address, the data of the strip SUYj stored in the storage node Nj in the distributed block storage system.


With reference to any one of the fourth to the sixth possible implementations of the second aspect of this application, in a seventh possible implementation of the second aspect, the data of SUYj includes at least one of the identifier of the first client and a time stamp TPY at which the first client obtains the stripe SY. The storage node Nj may determine, based on the identifier of the first client in the data of the strip SUYj, that the strip is written by the first client, and the storage node Nj may determine, based on the time stamp TPY at which the first client obtains the stripe SY and that is in the data of the strip SUYj, a sequence in which the first client writes strips.


With reference to the second aspect of this application or the first or the second possible implementation of the second aspect, in an eighth possible implementation of the second aspect, the method further includes receiving, by the storage node Nj, data of a strip SUKj in a stripe SK sent by a second client, where the data of the strip SUKj is obtained by dividing third data by the second client, the third data is obtained by receiving a third write request by the second client, the third write request includes the third data and the logical address, the logical address is used to determine that the third data is located in the partition P, that is, the logical address of the first data is the same as the logical address of the third data, and storing, by the storage node Nj based on a mapping between an identifier of a strip SUKj and a third physical address of the storage node Nj, the data of SUKj at the third physical address. Because the logical address is an address at which the data written by the client is stored in the distributed block storage system, that the logical address is located in the partition P and that the third data is located in the partition P have a same meaning. In the distributed block storage system, the first client and the second client may access the same logical address. Data of a data strip SUKj further includes metadata such as an identifier of the data strip SUKj, and a logical address of the data of the data strip SUKj.


With reference to the eighth possible implementation of the second aspect, in a ninth possible implementation of the second aspect, the method further includes assigning, by the storage node Nj, a time stamp TPKj to the data of the strip SUKj. The time stamp TPKj may be used as a reference time stamp at which the data of the strip in the stripe SK is recovered after another storage node is faulty.


With reference to the eighth or the ninth possible implementation of the second aspect of this application, in a tenth possible implementation of the second aspect, the method further includes establishing, by the storage node Nj, a correspondence between a logical address of the data of the strip SUKj and an identifier of the strip SUKj such that the client accesses, using the logical address, the data of the strip SUKj stored in the storage node Nj in the distributed block storage system.


With reference to any one of the eighth to the tenth possible implementations of the second aspect of this application, in an eleventh possible implementation of the second aspect, the data of SUKj includes at least one of an identifier of the second client and a time stamp TPK at which the second client obtains the stripe SK. The storage node Nj may determine, based on the identifier of the second client in the data of the strip SUKj, that the strip is written by the second client, and the storage node Nj may determine, based on the time stamp TPK at which the second client obtains the stripe SK and that is in the data of the strip SUKj, a sequence in which the second client writes strips.


With reference to the second aspect of this application, in a twelfth possible implementation of the second aspect, the strip SUij in the stripe Si is assigned by a stripe metadata server from the storage node Nj based on a mapping between the partition P and the storage node Nj included in the partition P. The stripe metadata server assigns a physical storage address to the strip SUij in the stripe Si from the storage node Nj in advance, and a waiting time of a client before the client writes data may be reduced, thereby improving write performance of the distributed block storage system.


With reference to any one of the second aspect of this application or the first to the twelfth possible implementations of the second aspect, in a thirteenth possible implementation of the second aspect, each piece of data of the one or more strips SUNj further includes data strip status information, and the data strip status information is used to identify whether each data strip of the stripe SN is empty such that it is not required that all-0 data be used to replace the data of the strip whose data is empty and be written to the storage node, thereby reducing a data write amount of the distributed block storage system.


With reference to the ninth possible implementation of the second aspect, in a fourteenth possible implementation of the second aspect, after the storage node Nj is faulty, a new storage node recovers the data of the strip SUNj and the data of SUKj based on the stripe SN and the stripe SK respectively, the new storage node obtains a time stamp TPNX of data of a strip SUNX in a storage node NX as a reference time stamp of the data of the strip SUNj, and obtains a time stamp TPKX of data of a strip SUKX in the storage node NX as a reference time stamp of the data of the strip SUKj, and the new storage node eliminates, from a buffer based on the time stamp TPNX and the time stamp TPKX, strip data, corresponding to an earlier time, in the data of the strip SUNj and the data of SUKj, where X is any integer from 1 to M other than j. Latest strip data is reserved in the storage system, thereby saving buffer space.


With reference to the seventh possible implementation of the second aspect, in a fifteenth possible implementation of the second aspect, after the storage node Nj is faulty, a new storage node recovers the data of the strip SUNj and the data of SUYj based on the stripe SN and the stripe SY respectively, where the data of the strip SUNX includes the time stamp TPN, and the data of the strip SUYj includes the time stamp TPY, and the new storage node eliminates, from a buffer based on the time stamp TPN and the time stamp TPY, the earlier one of the data of the strip SUNj and the data of SUYj, where X is any integer from 1 to M other than j. Latest strip data of the same client is reserved in the storage system, thereby saving buffer space.


With reference to the distributed block storage system in any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect, a third aspect of this application further provides an apparatus for writing data in a distributed block storage system. The apparatus for writing data includes a plurality of units configured to perform any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect.


With reference to the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect, a fourth aspect of this application further provides an apparatus for storing data in a distributed block storage system. The apparatus for storing data includes a plurality of units configured to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.


A fifth aspect of this application further provides the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect. A storage node Nj in the distributed block storage system is configured to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.


A sixth aspect of this application further provides a client, applied to the distributed block storage system in any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect. The client includes a processor and an interface, the processor communicates with the interface, and the processor is configured to perform any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect.


A seventh aspect of this application further provides a storage node, applied to the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect. The storage node used as a storage node Nj includes a processor and an interface, the processor communicates with the interface, and the processor is configured to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.


An eighth aspect of this application further provides a computer readable storage medium, applied to the distributed block storage system in any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect. The computer readable storage medium includes a computer instruction, used to enable a client to perform any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect.


A ninth aspect of this application further provides a computer readable storage medium, applied to the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect. The computer readable storage medium includes a computer instruction, used to enable a storage node to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.


A tenth aspect of this application further provides a computer program product, applied to the distributed block storage system in any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect. The computer program product includes a computer instruction, used to enable a client to perform any one of the first aspect of this application or the first to the seventh possible implementations of the first aspect.


An eleventh aspect of this application further provides a computer program product, applied to the distributed block storage system in any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect. The computer program product includes a computer instruction, used to enable a storage node to perform any one of the second aspect of this application or the first to the fifteenth possible implementations of the second aspect.





BRIEF DESCRIPTION OF DRAWINGS


FIG. 1 is a schematic diagram of data storage of a distributed block storage system;



FIG. 2 is a schematic diagram of a distributed block storage system according to an embodiment of the present disclosure;



FIG. 3 is a schematic structural diagram of a server in a distributed block storage system according to an embodiment of the present disclosure;



FIG. 4 is a schematic diagram of a partition view of a distributed block storage system according to an embodiment of the present disclosure;



FIG. 5 is a schematic diagram of a relationship between strips and storage nodes in a distributed block storage system according to an embodiment of the present disclosure;



FIG. 6 is a flowchart of a method for writing data by a client in a distributed block storage system according to an embodiment of the present disclosure;



FIG. 7 is a schematic diagram of determining a partition by a client in a distributed block storage system according to an embodiment of the present disclosure;



FIG. 8 is a flowchart of a method for storing data in a storage node in a distributed block storage system according to an embodiment of the present disclosure;



FIG. 9 is a schematic diagram of storing a stripe in a storage node in a distributed block storage system according to an embodiment of the present disclosure;



FIG. 10 is a schematic diagram of storing a stripe in a storage node in a distributed block storage system according to an embodiment of the present disclosure;



FIG. 11 is a schematic structural diagram of an apparatus for writing data in a distributed block storage system according to an embodiment of the present disclosure; and



FIG. 12 is a schematic structural diagram of an apparatus for storing data in a distributed block storage system according to an embodiment of the present disclosure.





DESCRIPTION OF EMBODIMENTS

A distributed block storage system in the embodiments of the present disclosure is, for example, Huawei® Fusionstorage® series. For example, as shown in FIG. 2, a distributed block storage system includes a plurality of servers such as a server 1, a server 2, a server 3, a server 4, a server 5, and a server 6, and the servers communicate with each other using the INFINIBAND, the ETHERNET, or the like. In actual application, a quantity of servers in the distributed block storage system may be increased based on an actual requirement. This is not limited in the embodiments of the present disclosure.


A server of the distributed block storage system includes a structure shown in FIG. 3. As shown in FIG. 3, each server in the distributed block storage system includes a central processing unit (CPU) 301, a memory 302, an interface 303, a hard disk 1, a hard disk 2, and a hard disk 3, the memory 302 stores a computer instruction, and the CPU 301 executes the computer instruction in the memory 302 to perform a corresponding operation. The interface 303 may be a hardware interface such as a network interface card (NIC) or a host bus adapter (HBA), or may be a program interface module or the like. A hard disk includes a solid-state drive (SSD), a mechanical hard disk, or a hybrid hard disk. The mechanical hard disk is, for example, a Hard Disk Drive (HDD). Additionally, to save computing resources of the CPU 301, a field programmable gate array (FPGA) or other hardware may replace the CPU 301 to perform the foregoing corresponding operation, or an FPGA or other hardware and the CPU 301 jointly perform the foregoing corresponding operation. For convenience of description, in the embodiments of the present disclosure, a combination of the CPU 301, the memory 302, the FPGA, and the other hardware replacing the CPU 301 or a combination of the FPGA, the other hardware replacing the CPU 301, and the CPU 301 is collectively referred to as a processor.


In the structure shown in FIG. 3, an application program is loaded into the memory 302, the CPU 301 executes an instruction of the application program in the memory 302, and the server is used as a client. Additionally, the client may be a device independent of the servers shown in FIG. 2. The application program may be a virtual machine (VM), or may be a particular application such as office software. The client writes data to the distributed block storage system or reads data from the distributed block storage system. For a structure of the client, refer to FIG. 3 and a related description. A program of the distributed block storage system is loaded into the memory 302, and the CPU 301 executes the program of the distributed block storage system in the memory 302 to provide a block protocol access interface to the client, and provide a distributed block storage access point service to the client such that the client accesses a storage resource in a storage resource pool in the distributed block storage system. The block protocol access interface is configured to provide a logical unit to the client. The server runs the program of the distributed block storage system such that the server that includes the hard disks and that is used as a storage node is configured to store data of the client. For example, in the server, one hard disk may be used as one storage node by default. That is, when the server includes a plurality of hard disks, the plurality of hard disks may be used as a plurality of storage nodes. In another implementation, the server runs the program of the distributed block storage system to serve as one storage node. This is not limited in the embodiments of the present disclosure. Therefore, for a structure of a storage node, refer to FIG. 3 and a related description. When the distributed block storage system is initialized, Hash space (such as 0 to 2̂32) is divided into N equal portions, each equal portion is one partition, and these N equal portions are averaged based on a quantity of hard disks. For example, in the distributed block storage system, N is 3600 by default, that is, partitions are P1, P2, P3, . . . , and P3600, respectively. If the current distributed block storage system includes 18 hard disks (storage nodes), each storage node bears 200 partitions. A partition P includes M storage nodes Nj, and a correspondence between a partition and a storage node, that is, a mapping between a partition and a storage node Nj included in the partition is also referred to as a partition view. As shown in FIG. 4, an example in which a partition includes four storage nodes Nj is used, and a partition view is “P2—storage node N1—storage node N2—storage node N3—storage node N4”, where j is every integer from 1 to M. When the distributed block storage system is initialized, the storage nodes are assigned. Subsequently, as a quantity of hard disks in the distributed block storage system changes, the storage nodes are adjusted. The client stores the partition view.


Based on a reliability requirement of the distributed block storage system, data reliability can be improved using an EC algorithm such as using a 3+1 mode, that is, a stripe includes three data strips and one check strip. In the embodiments of the present disclosure, a partition stores data in a stripe form, and one partition includes R stripes Si, where i is every integer from 1 to R. In the embodiments of the present disclosure, P2 is used as an example for description.


The distributed block storage system performs fragment management on a hard disk using 4 kilobytes (KB) as a unit, and records assignment information of each fragment of 4 KB in a metadata management area of the hard disk, and a storage resource pool includes fragments of the hard disk. The distributed block storage system includes a stripe metadata server, and in a specific implementation, a stripe metadata management program may be run on one or more servers in the distributed block storage system. The stripe metadata server assigns a stripe to a partition. Still using the partition view shown in FIG. 4 as an example, the stripe metadata server assigns, to a stripe Si of a partition P2 based on the partition view and as shown in FIG. 5, a physical storage address, that is, storage space, of a strip SUij in the stripe from a storage node Nj corresponding to the partition, and assigning includes assigning a physical storage address to SUi1 from a storage node N1, assigning a physical storage address to SUi2 from a storage node N2, assigning a physical storage address to SUi3 from a storage node N3, and assigning a physical storage address to SUi4 from a storage node N4. The storage node Nj records a mapping between an identifier of a strip SUij and a physical storage address. The stripe metadata server assigns a physical address to a strip in a stripe from a storage node, and the physical address may be assigned in advance when the distributed block storage system is initialized, or be assigned in advance before the client sends data to the storage node. In the embodiments of the present disclosure, the strip SUij in the stripe Si is only a segment of storage space before the client writes data. When receiving data, the client performs division based on a size of the strip SUij in the stripe Si to obtain data of the strip SUij, that is, the strip SUij included in the stripe Si is used to store the data of the strip SUij obtained by dividing data by the client. To reduce a quantity of strip identifiers managed by the stripe metadata server, the stripe metadata server assigns a version number to an identifier of a strip in a stripe. After a stripe is released, a version number of a strip identifier of a strip in the released stripe is updated in order to serve as a strip identifier of a strip in a new stripe. The stripe metadata server assigns a physical storage address to the strip SUij in the stripe Si from the storage node Nj in advance, and a waiting time of a client before the client writes data may be reduced, thereby improving write performance of the distributed block storage system.


In the embodiments of the present disclosure, a logical unit assigned by the distributed block storage system is mounted to the client, thereby performing a data access operation. The logical unit is also referred to as a logical unit number (LUN). In the distributed block storage system, one logical unit may be mounted to only one client, or one logical unit may be mounted to a plurality of clients, that is, a plurality of clients share one logical unit. The logical unit is provided by the storage resource pool shown in FIG. 2.


In an embodiment of the present disclosure, as shown in FIG. 6, a first client performs the following steps.


Step 601: The first client receives a first write request, where the first write request includes first data and a logical address.


In a distributed block storage system, the first client may be a VM or a server. An application program is run on the first client, and the application program accesses a logical unit mounted to the first client, for example, sends the first write request to the logical unit. The first write request includes the first data and the logical address, and the logical address is also referred to as a logical block address (LBA). The logical address is used to indicate a write location of the first data in the logical unit.


Step 602: The first client determines that the logical address is located in a partition P.


In this embodiment of the present disclosure, a partition P2 is used as an example. With reference to FIG. 4, the first client stores a partition view of the distributed block storage system. As shown in FIG. 7, the first client determines, based on the partition view, a partition in which the logical address included in the first write request is located. In an implementation, the first client generates a key based on the logical address, calculates a Hash value of the key based on a Hash algorithm, and determines a partition corresponding to the Hash value, thereby determining that the logical address is located in the partition P2. This also means that the first data is located in the partition P2.


Step 603: The first client obtains a stripe SN from R stripes, where N is an integer from 1 to R.


A stripe metadata server manages a correspondence between a partition and a stripe, and a relationship between a strip in a stripe and a storage node. In an implementation in which the first client obtains a stripe SN from R stripes, the first client determines that the logical address is located in the partition P2, and the first client queries the stripe metadata server to obtain a stripe SN of the R stripes included in the partition P2. Because the logical address is an address at which the data written by the client is stored in the distributed block storage system, that the logical address is located in the partition P and that the first data is located in the partition P have a same meaning. In another implementation in which the first client obtains a stripe SN from R stripes, the first client may obtain a stripe SN from stripes that are assigned to the first client and that are of the R stripes.


Step 604: The first client divides the first data into data of one or more strips SUNj in the stripe SN.


The stripe SN includes strips, and the first client receives the first write request, buffers the first data included in the first write request, and divides the buffered data based on a size of a strip in the stripe. For example, the first client performs division based on a length of the strip in the stripe to obtain strip size data, and performs a modulo operation on a quantity M (such as four) of storage nodes in the partition based on a logical address of the strip size data, thereby determining a location of the strip size data in the stripe, that is, a corresponding strip SUNj, and then determines a storage node Nj corresponding to the strip SUNj based on the partition view such that data of strips having a same logical address is located in a same storage node. For example, the first data is divided into data of one or more strips SUNj. In this embodiment of the present disclosure, P2 is used as an example. With reference to FIG. 5, the stripe SN includes four strips SUN1, SUN2, SUN3, and SUN4. An example in which the first data is divided into data of two strips is used, that is, the data of two strips is data of SUN1 and data of SUN2. Data of the strip SUN3 may be obtained by dividing data in another write request sent by the first client. For details, refer to the description of the first write request. Then, data of the check strip SUN4 is generated based on the data of SUN1, the data of SUN2, and the data of SUN3, and the data of the check strip SUN4 is also referred to as check data. For how to generate the data of the check strip based on the data of the data strips in the stripe, refer to an existing stripe implementation algorithm. Details are not described again in this embodiment of the present disclosure.


In this embodiment of the present disclosure, the stripe SN includes four strips, that is, three data strips and one check strip. When the first client buffers data and needs to write the data to a storage node after a period of time, but cannot make data of the data strips full, for example, there are only the data of the strip SUN1 and the data of SUN2 obtained by dividing the first data, the check strip is generated based on the data of SUN1 and the data of SUN2. Data of a valid data strip SUNj includes data strip status information of the stripe SN, and the valid data strip SUNj is a strip whose data is not empty. In this embodiment of the present disclosure, both the data of the valid data strip SUN1 and the data of SUN2 include the data strip status information of the stripe SN, and the data strip status information is used to identify whether each data strip of the stripe SN is empty. For example, if 1 is used to indicate that a data strip is not empty, and 0 is used to indicate that a data strip is empty, the data strip status information included in the data of SUN1 is 110, and the data strip status information included in the data of SUN2 is 110, indicating that SUN1 is not empty, SUN2 is not empty, and SUN3 is empty. The data of the check strip SUN4 generated based on the data of SUN1 and the data of SUN2 includes check data of the data strip status information. Because SUN3 is empty, the first client does not need to replace the data of SUN3 with all-0 data and write the all-0 data to a storage node N3, thereby reducing a data write amount. When reading the stripe SN, the first client determines, based on the data strip status information of the stripe SN included in the data of the data strip SUN1 or the data of SUN2, that the data of SUN3 is empty.


When SUN3 is not empty, the data strip status information included in the data of SUN1, the data of SUN2, and the data of SUN3 in this embodiment of the present disclosure is 111, and the data of the check strip SUN4 generated based on the data of SUN1, the data of SUN2, and the data of SUN3 includes check data of the data strip status information.


Further, in this embodiment of the present disclosure, the data of the data strip SUNj further includes at least one of an identifier of the first client and a time stamp TPN at which the first client obtains the stripe SN, that is, includes any one of or a combination of the identifier of the first client and the time stamp TPN at which the first client obtains the stripe SN. When data of a check strip SUNj is generated based on the data of the data strip SUNj, the data of the check strip SUNj also includes check data of at least one of the identifier of the first client and the time stamp TPN at which the first client obtains the stripe SN.


In this embodiment of the present disclosure, the data of the data strip SUNj further includes metadata such as an identifier of the data strip SUNj, and a logical address of the data of the data strip SUNj.


Step 605: The first client sends the data of the one or more strips SUNj to a storage node Nj.


In this embodiment of the present disclosure, the first client sends the data of SUN1 obtained by dividing the first data to the storage node N1, and sends the data of SUN2 obtained by dividing the first data to the storage node N2. The first client may concurrently send the data of the strip SUNj of the stripe SN to the storage node Nj without needing a primary storage node in order to reduce data exchange between the storage nodes, and improve write concurrency, thereby improving write performance of the distributed block storage system.


Further, if a logical unit is mounted to only the first client, the first client receives a second write request, where the second write request includes second data and the logical address that is described in FIG. 6, the first client determines, based on the algorithm described in the process in FIG. 6, that the logical address is located in the partition P2, the first client obtains a stripe SY from the R stripes, the first client divides the second data into data of one or more strips SUYj in the stripe SY, such as data of SUY1 and data of SUY2, and the first client sends the data of the one or more strips SUYj to the storage node Nj, that is, sends the data of SUY1 to the storage node N1, and sends the data of SUY2 to the storage node N2, where Y is an integer from 1 to R, and N is different from Y. In this embodiment of the present disclosure, that the logical address is located in the partition P and that the second data is located in the partition P have a same meaning. Further, data of a valid data strip SUYj includes data strip status information of the stripe SY. Further, the data of the data strip SUYj further includes at least one of an identifier of the first client and a time stamp TPY at which the first client obtains the stripe SY. Further, the data of the data strip SUYj further includes metadata of the data of the data strip SUYj, such as an identifier of the strip SUYj, and a logical address of the data of the strip SUYj. For a further description, refer to the description of the first client in FIG. 6. Details are not described herein again. For obtaining, by the first client, the stripe SY from the R stripes, refer to obtaining, by the first client, the stripe SN from the R stripes. Details are not described herein again.


Further, if a logical unit is mounted to a plurality of clients, for example, mounted to the first client and a second client, the second client receives a third write request, where the third write request includes third data and the logical address that is described in FIG. 6. The second client determines, based on the algorithm described in the process in FIG. 6, that the logical address is located in the partition P2, the second client obtains a stripe SK from the R stripes, the second client divides the third data into data of one or more strips SUKj in the stripe SK, such as data of SUK1 and data of SUK2, and the second client sends the data of the one or more strips SUKj to the storage node Nj, that is, sends the data of SUK1 to the storage node N1, and sends the data of SUK2 to the storage node N2, where K is an integer from 1 to R, and N is different from K. That the logical address is located in the partition P and that the third data is located in the partition P have a same meaning. For the meaning of obtaining, by the second client, the stripe SK from the R stripes, refer to the meaning of obtaining, by the first client, the stripe SN from the R stripes. Details are not described herein again. Further, data of a valid data strip SUKj includes data strip status information of the stripe SK. Further, the data of the data strip SUKj further includes at least one of an identifier of the second client and a time stamp TPK at which the second client obtains the stripe SK. Further, the data of the data strip SUKj further includes metadata such as an identifier of the data strip SUKj, and a logical address of the data of the data strip SUKj. For a further description of the second client, refer to the description of the first client in FIG. 6. Details are not described herein again.


In other approaches, a client needs to first send data to a primary storage node, and the primary storage node divides the data into data of strips, and sends data of strips other than a strip stored in the primary storage node to corresponding storage nodes. As a result, the primary storage node becomes a data storage bottleneck in a distributed block storage system, and data exchange between the storage nodes is increased. However, in the embodiment shown in FIG. 6, the client divides the data into the data of the strips, and sends the data of the strips to the corresponding storage nodes without needing a primary storage node in order to alleviate a pressure of the primary storage node, reduce data exchange between the storage nodes, and the data of the strips of the stripe is concurrently written to the corresponding storage nodes in order to also improve write performance of the distributed block storage system.


Corresponding to the embodiment of the first client shown in FIG. 6, as shown in FIG. 8, a storage node Nj performs the following steps.


Step 801: The storage node Nj receives data of a strip SUNj in a stripe SN sent by a first client.


With reference to the embodiment shown in FIG. 6, a storage node N1 receives data of SUN1 sent by the first client, and a storage node N2 receives data of SUN2 sent by the first client.


Step 802: The storage node Nj stores, based on a mapping between an identifier of the strip SUNj and a first physical address of the storage node Nj, the data of SUNj at the first physical address.


A stripe metadata server assigns, in the storage node Nj, the first physical address to the strip SUNj of the stripe SN in a partition in advance based on a partition view, metadata of the storage node Nj stores the mapping between the identifier of the strip SUNj and the first physical address of the storage node Nj, and the storage node Nj receives the data of the strip SUNj, and stores the data of the strip SUNj at the first physical address based on the mapping. For example, the storage node N1 receives the data of SUN1 sent by the first client, and stores the data of SUN1 at the first physical address of N1, and the storage node N2 receives the data of SUN2 sent by the first client, and stores the data of SUN2 at the first physical address of N2.


In the other approaches, a primary storage node needs data sent by a client, divides the data into data of data strips in a stripe, forms data of a check strip based on the data of the data strips, and sends data of strips stored in other storage nodes to corresponding storage nodes. However, in this embodiment of the present disclosure, the storage node Nj receives only the data of the strip SUNj sent by the client without needing a primary storage node in order to reduce data exchange between storage nodes, and data of strips is concurrently written to the corresponding storage nodes in order to improve write performance of a distributed block storage system.


Further, the data of the strip SUNj is obtained by dividing first data, a first write request includes a logical address of the first data, and the data of the strip SUNj used as a part of the first data also has a corresponding logical address. Therefore, the storage node Nj establishes a mapping between the logical address of the data of the strip SUNj and the identifier of the strip SUNj. In this way, the first client still accesses the data of the strip SUNj using the logical address. For example, when the first client accesses the data of the strip SUNj, the first client performs a modulo operation on a quantity M (such as four) of storage nodes in a partition P using the logical address of the data of the strip SUNj, determines that the strip SUNj is located in the storage node Nj, and sends a read request carrying the logical address of the data of the strip SUNj to the storage node Nj, the storage node Nj obtains the identifier of the strip SUNj based on the mapping between the logical address of the data of the strip SUNj and the identifier of the strip SUNj, and the storage node Nj obtains the data of the strip SUNj based on the mapping between the identifier of the strip SUNj and the first physical address of the storage node Nj.


With reference to the embodiment shown in FIG. 6 and the related description, further, the storage node Nj receives data of a strip SUYj in a stripe SY sent by the first client. For example, the storage node N1 receives data of SUY1 sent by the first client, and the storage node N2 receives data of SUY2 sent by the first client. The storage node Nj stores, based on a mapping between an identifier of the strip SUYj and a second physical address of the storage node Nj, the data of SUYj at the second physical address, for example, stores the data of SUY1 at a second physical address of N1, and stores the data of SUY2 at a second physical address of N2. The data of the strip SUYj used as a part of second data also has a corresponding logical address, and therefore, the storage node Nj establishes a mapping between the logical address of the data of the strip SUYj and the identifier of the strip SUYj. In this way, the first client still accesses the data of the strip SUY using the logical address. The data of the strip SUYj and the data of the strip SUNj have the same logical address.


With reference to the embodiment shown in FIG. 6 and the related description, when a logical unit is mounted to the first client and the second client, further, the storage node Nj receives data of a strip SUKj in a stripe SK sent by the second client. For example, the storage node N1 receives data of SUK1 sent by the second client, and the storage node N2 receives data of SUK2 sent by the second client. The storage node Nj stores, based on a mapping between an identifier of the strip SUKj and a third physical address of the storage node Nj, the data of SUKj at the third physical address, for example, stores the data of SUK1 at a third physical address of N1, and stores the data of SUK2 at a third physical address of N2. The data of the strip SUKj used as a part of third data also has a corresponding logical address, and therefore, the storage node Nj establishes a mapping between the logical address of the data of the strip SUKj and the identifier of the strip SUKj. In this way, the second client still accesses the data of the strip SUKj using the logical address. The data of the strip SUKj and the data of the strip SUNj have the same logical address.


Further, the storage node Nj assigns a time stamp TPNj to the data of the strip SUNj, the storage node Nj assigns a time stamp TPKj to the data of the strip SUKj, and the storage node Nj assigns a time stamp TPYj to the data of the strip SUYj. The storage node Nj may eliminate, based on the time stamps, strip data, corresponding to an earlier time, in strip data that has a same logical address in a buffer, and reserve latest strip data, thereby saving buffer space.


With reference to the embodiment shown in FIG. 6 and the related description, when a logical unit is mounted to only the first client, the data of the strip SUNj sent by the first client to the storage node Nj includes the time stamp TPN at which the first client obtains the stripe SN, and the data of the strip SUYj sent by the first client to the storage node Nj includes the time stamp TPY at which the first client obtains the stripe SY. As shown in FIG. 9, none of data strips of a stripe SN is empty, each of data of SUNj, data of SUN2, and data of SUN3 includes the time stamp TPN at which the first client obtains the stripe SN, and data of a check strip SUN4 of the stripe SN includes check data TPNp of the time stamp TPN, and none of data strips of the stripe SY is empty, each of data of SUY1, data of SUY2, and data of SUY3 includes the time stamp TPY at which the first client obtains the stripe SY, and data of a check strip SUY4 of the stripe SY includes check data TPYp of the time stamp TPY. Therefore, after a storage node storing a data strip is faulty, the distributed block storage system recovers, in a new storage node based on the stripes and the partition view, the data of the strip SUNj of the stripe SN in the faulty storage node Nj, and recovers the data of the strip SUYj of the stripe SY in the faulty storage node Nj. Therefore, a buffer of the new storage node includes the data of the strip SUNj and the data of SUYj. The data of SUNj includes the time stamp TPN, and the data of SUYj includes the time stamp TPY. Because both the time stamp TPN and the time stamp TPY are assigned by the first client or assigned by a same time stamp server, the time stamps may be compared. The new storage node eliminates, from the buffer based on the time stamp TPN and the time stamp TPY, strip data corresponding to an earlier time. The new storage node may be a storage node obtained by recovering the faulty storage node Nj, or a storage node of a partition in which a newly added stripe is located in the distributed block storage system. In this embodiment of the present disclosure, an example in which the storage node N1 is faulty is used, and the buffer of the new storage node includes the data of the strip SUN1 and the data of SUY1. The data of SUN1 includes the time stamp TPN, the data of SUY1 includes the time stamp TPY, and the time stamp TPN is earlier than the time stamp TPY. Therefore, the new storage node eliminates the data of the strip SUN1 from the buffer, and reserves latest strip data in the storage system, thereby saving buffer space. The storage node Nj may eliminate, based on time stamps assigned by a same client, strip data, corresponding to an earlier time, in strip data that is from the same client and that has a same logical address in the buffer, and reserve latest strip data, thereby saving buffer space.


With reference to the embodiment shown in FIG. 6 and the related description, when a logical unit is mounted to the first client and the second client, the storage node Nj assigns a time stamp TPNj to the data of the strip SUNj sent by the first client, and the storage node Nj assigns the time stamp TPKj to the data of the strip SUKj sent by the second client. As shown in FIG. 10, none of data strips of a stripe SN is empty, a time stamp assigned by a storage node N1 to data of a strip SUN1 is TPN1, a time stamp assigned by a storage node N2 to data of a strip SUN2 is TPN2, a time stamp assigned by a storage node N3 to data of a strip SUN3 is TPN3, and a time stamp assigned by a storage node N4 to data of a strip SUN4 is TPN4, and none of data strips of a stripe SK is empty, a time stamp assigned by the storage node N1 to data of a strip SUK1 is TPK1, a time stamp assigned by the storage node N2 to data of a strip SUK2 is TPK2, a time stamp assigned by the storage node N3 to data of a strip SUK3 is TPK3, and a time stamp assigned by the storage node N4 to data of a strip SUK4 is TPK4. Therefore, after a storage node storing a data strip is faulty, the distributed block storage system recovers, based on the stripes and the partition view, the data of the strip SUNj of the stripe SN in the faulty storage node Nj, and recovers the data of the strip SUKj of the stripe SK in the faulty storage node Nj, a buffer of a new storage node includes the data of the strip SUNj and the data of SUKj. In an implementation, when the data of the strip SUNj includes the time stamp TPN assigned by the first client, the data of the strip SUKj includes the time stamp TPK assigned by the second client, and TPN and TPK are assigned by a same time stamp server, the time stamp TPN of the data of the strip SUNj may be directly compared with the time stamp TPK of the data of SUKj, and the new storage node eliminates, from the buffer based on the time stamp TPN and the time stamp TPK, strip data corresponding to an earlier time. When the data of the strip SUNj does not include the time stamp TPN assigned by the first client and/or the data of the strip SUKj does not include the time stamp TPK assigned by the second client, or when the time stamps TPN and TPK are not from a same time stamp server, the buffer of the new storage node includes the data of the strip SUNj and the data of SUKj. The new storage node may query for time stamps of data of strips of stripes SN and SK in a storage node NX. For example, the new storage node obtains a time stamp TPNX assigned by the storage node NX to data of a strip SUNX, and uses TPNX as a reference time stamp of the data of SUNj, and the new storage node obtains a time stamp TPKX assigned by the storage node NX to data of a strip SUKX, and uses TPKX as a reference time stamp of the data of SUKj, and the new storage node eliminates, from the buffer based on the time stamp TPNX and the time stamp TPKX, strip data, corresponding to an earlier time, in the data of the strip SUNj and the data of SUKj, where X is any integer from 1 to M other than j. In this embodiment of the present disclosure, an example in which the storage node N1 is faulty is used, and the buffer of the new storage node includes the data of the strip SUN1 and the data of SUK1. The new storage node obtains a time stamp TPN2 assigned by the storage node N2 to the data of SUN2 as a reference time stamp of the data of SUN1, and obtains a time stamp TPK2 assigned by the storage node N2 to the data of SUK2 as a reference time stamp of the data of SUK1, and the time stamp TPN2 is earlier than the time stamp TPK2. Therefore, the new storage node eliminates the data of the strip SUN1 from the buffer, and reserves latest strip data in the storage system, thereby saving buffer space. In this embodiment of the present disclosure, the storage node Nj also assigns a time stamp TPYj to the data of the strip SUYj.


A time stamp assigned by the storage node Nj may be from a time stamp server, or may be generated by the storage node Nj.


Further, an identifier of the first client included in the data of the data strip SUNj, a time stamp at which the first client obtains the stripe SN, an identifier of the data strip SUNj, a logical address of the data of the data strip SUNj, and data strip status information may be stored at an extension address of a physical address assigned by the storage node Nj to the data strip SUNj, thereby avoiding use of a physical address of the storage node Nj. The extension address of the physical address is a physical address that is invisible beyond a valid physical address capacity of the storage node Nj, and when receiving a read request for accessing the physical address, the storage node Nj reads data in the extension address of the physical address by default. An identifier of the second client included in the data of the data strip SUKj, a time stamp at which the second client obtains the stripe SK, an identifier of the data strip SUKj, a logical address of the data of the data strip SUKj, and data strip status information may also be stored at an extension address of a physical address assigned by the storage node Nj to the data strip SUKj. Likewise, an identifier of the first client included in the data strip SUYj, a time stamp at which the first client obtains the stripe SY, an identifier of the data strip SUYj, a logical address of the data of the data strip SUYj, and data strip status information may also be stored at an extension address of a physical address assigned by the storage node Nj to the data strip SUYj.


Further, the time stamp TPNj assigned by the storage node Nj to the data of the strip SUNj may also be stored at the extension address of the physical address assigned by the storage node Nj to the data strip SUNj. The time stamp TPKj assigned by the storage node Nj to the data of the strip SUKj may also be stored at the extension address of the physical address assigned by the storage node Nj to the data strip SUKj. The time stamp TPYj assigned by the storage node Nj to the data of the strip SUYj may also be stored at the extension address of the physical address assigned by the storage node Nj to the data strip SUYj.


With reference to various implementations of the embodiments of the present disclosure, an embodiment of the present disclosure provides an apparatus 11 for writing data, applied to a distributed block storage system in the embodiments of the present disclosure. As shown in FIG. 11, the apparatus 11 for writing data includes a receiving unit 111, a determining unit 112, an obtaining unit 113, a division unit 114, and a sending unit 115. The receiving unit 111 is configured to receive a first write request, where the first write request includes first data and a logical address. The determining unit 112 is configured to determine that the logical address is located in a partition P. The obtaining unit 113 is configured to obtain a stripe SN from R stripes, where N is an integer from 1 to R. The division unit 114 is configured to divide the first data into data of one or more strips SUNj in the stripe SN. The sending unit 115 is configured to send the data of the one or more strips SUNj to a storage node Nj. Further, the receiving unit 111 is further configured to receive a second write request, where the second write request includes second data and a logical address, and the logical address of the second data is the same as the logical address of the first data. The determining unit 112 is further configured to determine that the logical address is located in the partition P. The obtaining unit 113 is further configured to obtain a stripe SY from the R stripes, where Y is an integer from 1 to R, and N is different from Y. The division unit 114 is further configured to divide the second data into data of one or more strips SUYj in the stripe SY. The sending unit 115 is further configured to send the data of the one or more strips SUYj to the storage node Nj. For an implementation of the apparatus 11 for writing data in this embodiment of the present disclosure, refer to clients in the embodiments of the present disclosure, such as a first client and a second client. Further, the apparatus 11 for writing data may be a software module, and may be run on a client such that the client completes various implementations described in the embodiments of the present disclosure. Alternatively, the apparatus 11 for writing data may be a hardware device. For details, refer to the structure shown in FIG. 3. Units of the apparatus 11 for writing data may be implemented by the processor of the server described in FIG. 3. Therefore, for a detailed description about the apparatus 11 for writing data, refer to the descriptions of the clients in the embodiments of the present disclosure.


With reference to various implementations of the embodiments of the present disclosure, an embodiment of the present disclosure provides an apparatus 12 for storing data, applied to a distributed block storage system in the embodiments of the present disclosure. As shown in FIG. 12, the apparatus 12 for storing data includes a receiving unit 121 and a storage unit 122. The receiving unit 121 is configured to receive data of a strip SUNj in a stripe SN sent by a first client, where the data of the strip SUNj is obtained by dividing first data by the first client, the first data is obtained by receiving a first write request by the first client, the first write request includes first data and a logical address, and the logical address is used to determine that the first data is located in a partition P. The storage unit 122 is configured to store, based on a mapping between an identifier of the strip SUNj and a first physical address of the storage node Nj, the data of SUNj at the first physical address.


With reference to FIG. 12, the apparatus 12 for storing data further includes an assignment unit configured to assign a time stamp TPNj to the data of the strip SUNj.


Further, with reference to FIG. 12, the apparatus 12 for storing data further includes an establishment unit configured to establish a correspondence between a logical address of the data of the strip SUNj and the identifier of the strip SUNj.


Further, with reference to FIG. 12, the receiving unit 121 is further configured to receive data of a strip SUYj in a stripe SY sent by the first client, where the data of the strip SUYj is obtained by dividing second data by the first client, the second data is obtained by receiving a second write request by the first client, the second write request includes second data and a logical address, and the logical address is used to determine that the second data is located in the partition P. The storage unit 122 is further configured to store, based on a mapping between an identifier of a strip SUY and a second physical address of the storage node Nj, the data of SUYj at the second physical address. Further, with reference to FIG. 12, the assignment unit is further configured to assign a time stamp TPYj to the data of the strip SUYj.


Further, with reference to FIG. 12, the establishment unit is further configured to establish a correspondence between a logical address of the data of the strip SUYj and an identifier of the strip SUYj. Further, the data of SUYj includes at least one of an identifier of the first client and a time stamp TPY at which the first client obtains the stripe SY.


Further, with reference to FIG. 12, the receiving unit 121 is further configured to receive data of a strip SUKj in a stripe SK sent by a second client, where the data of the strip SUKj is obtained by dividing third data by the second client, the third data is obtained by receiving a third write request by the second client, the third write request includes third data and a logical address, and the logical address is used to determine that the third data is located in the partition P. The storage unit 122 is further configured to store, based on a mapping between an identifier of a strip SUKj and a third physical address of the storage node Nj, the data of SUKj at the third physical address. Further, the assignment unit is further configured to assign a time stamp TPKj to the strip SUKj. Further, the establishment unit is further configured to establish a correspondence between a logical address of the data of the strip SUKj and an identifier of the strip SUKj. Further, the apparatus 12 for storing data further includes a recovery unit configured to after the storage node Nj is faulty, recover the data of the strip SUNj based on the stripe SN and recover the data of the strip SUKj based on the stripe SK. The apparatus 12 for storing data further includes an obtaining unit configured to obtain a time stamp TPNX of data of a strip SUNX in a storage node NX as a reference time stamp of the data of the strip SUNj, and obtain a time stamp TPKX of data of a strip SUix in the storage node NX as a reference time stamp of the data of the strip SUKj. The apparatus 12 for storing data further includes an elimination unit configured to eliminate, from a buffer of the new storage node based on the time stamp TPNX and the time stamp TPKX, strip data, corresponding to an earlier time, in the data of the strip SUNj and the data of SUKj, where X is any integer from 1 to M other than j.


For an implementation of the apparatus 12 for storing data in this embodiment of the present disclosure, refer to a storage node in the embodiments of the present disclosure, such as a storage node Nj. Further, the apparatus 12 for storing data may be a software module, and may be run on a server such that the storage node completes various implementations described in the embodiments of the present disclosure. Alternatively, the apparatus 12 for storing data may be a hardware device. For details, refer to the structure shown in FIG. 3. Units of the apparatus 12 for storing data may be implemented by the processor of the server described in FIG. 3. Therefore, for a detailed description about the apparatus 12 for storing data, refer to the description of the storage node in the embodiments of the present disclosure.


In this embodiment of the present disclosure, in addition to a stripe generated based on an EC algorithm described above, the stripe may be a stripe generated based on a multi-copy algorithm. When the stripe is a stripe generated based on the EC algorithm, the strips SUij in the stripe include a data strip and a check strip. When the stripe is a stripe generated based on the multi-copy algorithm, all the strips SUij in the stripe are data strips, and the strips SUij have same data.


Correspondingly, an embodiment of the present disclosure further provides a computer readable storage medium and a computer program product, and the computer readable storage medium and the computer program product include a computer instruction used to implement various solutions described in the embodiments of the present disclosure.


In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the unit division in the described apparatus embodiment is merely logical function division and may be other division in actual implementation. For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.


The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions in the embodiments.


In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

Claims
  • 1. A method for storing data in a distributed block storage system comprising a partition (P), the P comprising M storage nodes and R stripes, each stripe comprising strips (SUij), j comprising every integer from 1 to M, i comprising every integer from 1 to R, and the method comprising: receiving, by a storage node (Nj), data of a strip (SUNj) in a stripe (SN) from a first client, the data of the SUNj being obtained by dividing first data by the first client, the first data being obtained by receiving a first write request by the first client, the first write request comprising the first data and a logical address, and the logical address determining whether the first data is located in the P; andstoring, by the Nj based on a mapping between an identifier of the SUNj and a first physical address of the Nj, the data of the SUNj at the first physical address.
  • 2. The method of claim 1, further comprising assigning, by the Nj, a time stamp (TPNj) to the data of the SUNj.
  • 3. The method of claim 1, further comprising establishing, by the Nj, a correspondence between a logical address of the data of the SUNj and the identifier of the SUNj.
  • 4. The method of claim 1, wherein the data of the SUNj comprises at least one of an identifier of the first client or a time stamp (TPN) at which the first client obtains the SN.
  • 5. The method of claim 1, further comprising: receiving, by the Nj, data of a strip (SUYj) in another stripe (SY) from the first client, the data of the SUYj being obtained by dividing second data by the first client, the second data being obtained by receiving a second write request by the first client, the second write request comprising the second data and the logical address, and the logical address determining whether the second data is located in the P; andstoring, by the Nj based on a mapping between an identifier of the SUYj and a second physical address of the Nj, the data of the SUYj at the second physical address.
  • 6. The method of claim 5, further comprising assigning, by the Nj, a time stamp (TPYj) to the data of the SUYj.
  • 7. The method of claim 6, further comprising establishing, by the Nj, a correspondence between a logical address of the data of the SUYj and the identifier of the SUYj.
  • 8. The method of claim 7, wherein the data of the SUYj comprises at least one of an identifier of the first client or a time stamp (TPY) at which the first client obtains the SY.
  • 9. The method of claim 1, further comprising: receiving, by the Nj, data of a strip (SUKj) in another stripe (SK) from a second client, the data of the SUKj being obtained by dividing third data by the second client, the third data being obtained by receiving a third write request by the second client, the third write request comprising the third data and the logical address, and the logical address determining whether the third data is located in the P; andstoring, by the Nj based on a mapping between an identifier of the SUKj and a third physical address of the Nj, the data of the SUKj at the third physical address.
  • 10. The method of claim 1, wherein a strip (SUij) in another stripe (Si) is assigned by a stripe metadata server from the Nj based on a mapping between the P and the Nj comprised in the P.
  • 11. The method of claim 1, wherein each piece of data of one or more strips further comprises data strip status information, and the data strip status information identifying whether each data strip of a stripe is empty.
  • 12. The method of claim 9, further comprising assigning, by the Nj, a time stamp (TPKj) to the data of the SUKj.
  • 13. The method of claim 12, further comprising: recovering, by a new storage node, the data of the SUNj based on the SN and the data of the SUKj based on the SK after the Nj becomes faulty;obtaining, by the new storage node, a time stamp (TPNx) of data of a strip (SUNX) in another storage node (NX) as a reference time stamp of the data of the SUNj and a time stamp (TPKX) of data of a strip (SUKX) in the NX as a reference time stamp of the data of the SUKj; andeliminating, by the new storage node, from a buffer based on the TPNX and the TPKX, strip data, corresponding to an earlier time, in the data of the SUNj and the data of the SUKj, X comprising any integer from 1 to M other than j.
  • 14. A storage node, applied to a distributed block storage system comprising a partition (P), the P comprising M storage nodes and R stripes, each stripe comprising strips (SUij), j comprising every integer from 1 to M, i comprising every integer from 1 to R, and the storage node comprising: an interface; anda processor coupled to the interface to communicate with the interface and configured to: receive data of a strip (SUNj) in a stripe (SN) from a first client, the data of the SUNj being obtained by dividing first data by the first client, the first data being obtained by receiving a first write request by the first client, the first write request comprising the first data and a logical address, and the logical address determining whether the first data is located in the P; andstore, based on a mapping between an identifier of the SUNj and a first physical address of the Nj, the data of the SUNj at the first physical address.
  • 15. The storage node of claim 14, wherein the processor is further configured to assign a time stamp (TPNj) to the data of the SUNj.
  • 16. The storage node of claim 14, wherein the processor is further configured to: receive data of a strip (SUYj) in another stripe (SY) from the first client, the data of the SUYj being obtained by dividing second data by the first client, the second data being obtained by receiving a second write request by the first client, the second write request comprising the second data and the logical address, and the logical address determining whether the second data is located in the P; andstore, based on a mapping between an identifier of the SUYj and a second physical address of the Nj, the data of the SUYj at the second physical address.
  • 17. The storage node of claim 14, wherein the processor is further configured to: receive data of a strip (SUKj) in another stripe (SK) from a second client, the data of the SUKj being obtained by dividing third data by the second client, the third data being obtained by receiving a third write request by the second client, the third write request comprising the third data and the logical address, and the logical address determining whether the third data is located in the P; andstore, based on a mapping between an identifier of the SUKj and a third physical address of the Nj, the data of the SUKj at the third physical address.
  • 18. The storage node of claim 14, wherein a strip (SUij) in another stripe (Si) is assigned by a stripe metadata server from the Nj based on a mapping between the P and the Nj comprised in the P.
  • 19. The storage node of claim 14, wherein each piece of data of one or more strips further comprises data strip status information, and the data strip status information identifying whether each data strip of a stripe is empty.
  • 20. A computer readable storage medium, comprising a computer instruction applied to a distributed block storage system comprising a partition (P), the P comprising M storage nodes and R stripes, each stripe comprises strips (SUij), j comprising every integer from 1 to M, i comprising every integer from 1 to R, and the computer readable storage medium further comprising a first computer instruction to enable a storage node (Nj) to perform the following operations of: receiving data of a strip (SUNj) in a stripe (SN) from a first client, the data of the SUNj being obtained by dividing first data by the first client, the first data being obtained by receiving a first write request by the first client, the first write request comprising the first data and a logical address, and the logical address determining whether the first data is located in the P; andstoring, based on a mapping between an identifier of the SUNj and a first physical address of the Nj, the data of the SUNj at the first physical address.
  • 21. The computer readable storage medium of claim 20, further comprising a second computer instruction to enable the Nj to perform the following operations of: receiving data of a strip (SUKj) in another stripe (SK) from a second client, the data of the SUKj being obtained by dividing second data by the second client, the second data being obtained by receiving a second write request by the second client, the second write request comprising the second data and the logical address, and the logical address determining whether the second data is located in the P; andstoring, based on a mapping between an identifier of the SUKj and a second physical address of the Nj, the data of the SUKj at the second physical address.
  • 22. The computer readable storage medium of claim 21, further comprising a third computer instruction to enable the Nj to perform the following operation of assigning a time stamp (TPKj) to the data of the SUKj.
  • 23. The computer readable storage medium of claim 22, wherein in response to the storage node Nj is faulty, a fourth computer instruction enables a new storage node to perform the following operations of: recovering the data of the SUNj based on the SN and the data of the SUKj based on the SK;obtaining a time stamp (TPNX) of data of a strip (SUNX) in another storage node (NX) as a reference time stamp of the data of the SUNj and a time stamp (TPKX) of data of a strip (SUKX) in the NX as a reference time stamp of the data of the SUKj; andeliminating, from a buffer of the new storage node based on the TPNX and the TPKX, strip data, corresponding to an earlier time, in the data of the SUNj and the data of the SUKj, X comprising any integer from 1 to M other than j.
CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of International Patent Application No. PCT/CN2017/106147 filed on Oct. 13, 2017, which is hereby incorporated by reference in its entirety.

Continuations (1)
Number Date Country
Parent PCT/CN2017/106147 Oct 2017 US
Child 16172264 US