DATA PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM

TECHNICAL FIELD

The present disclosure relates to the field of computer technology, and, more particularly, to data processing methods and apparatuses, electronic devices, and storage media.

BACKGROUND

In order to ensure data reliability in a distributed system, a piece of data is usually stored on multiple nodes, and the data on the multiple nodes need to be kept consistent. However, some distributed systems may have inconsistent data copies due to various reasons. For example, when users in a Cassandra database use one, two, quorum, and other different levels to write data to multiple data copies, some of the copies may be incomplete. Common distributed systems also have their own data repair functions, such as the hint & read-repair mechanism of the Cassandra database, but this repair mechanism may cause a large system resource overhead and high operation and maintenance costs.

SUMMARY

This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify all key features or essential features of the claimed subject matter, nor is it intended to be used alone as an aid in determining the scope of the claimed subject matter. The term “technique(s) or technical solution(s)” for instance, may refer to apparatus(s), system(s), method(s) and/or computer-readable instructions as permitted by the context above and throughout the present disclosure.

Embodiments of the present disclosure provide data processing methods and apparatuses, electronic devices, and computer-readable storage media.

An embodiment of the present disclosure provides a data processing method, comprising:

determining a master data range on a current node, wherein master data in the master data range corresponds to multiple copy data stored on other nodes;

segmenting the master data range into multiple first sub-data ranges; and

performing data repair on each of the first sub-data ranges respectively, so as to repair inconsistent data between sub-data in the first sub-data ranges and corresponding copy sub-data in the copy data to make them consistent.

Further, the performing data repair on each of the first sub-data ranges respectively, so as to repair inconsistent data between sub-data in the first sub-data ranges and corresponding copy sub-data in the copy data to make them consistent comprises:

generating first data repair tasks corresponding to each of the first sub-data ranges, wherein the repair tasks are used for repairing the inconsistent data between the sub-data in the first sub-data ranges and the corresponding copy sub-data in the copy data to make them consistent; and

assigning priorities to the first data repair tasks and then submitting them to a task queue, so that the first data repair tasks are executed from the task queue according to the priorities.

Further, the method also comprises:

assigning a repair identifier to repaired data in the first sub-data ranges; and

identifying the first sub-data ranges as being in a repair-completed state after all data in the first sub-data ranges are assigned the repair identifier.

Further, the method also comprises:

determining a second sub-data range based on a first piece of data that starts to fail in the repair to a first piece of data that starts to succeed in the repair;

generating a second data repair task corresponding to the second sub-data range; and

assigning a priority to the second data repair task and then submitting it to the task queue.

Further, the method also comprises:

after the current node is recovered from a downtime, regenerating the first data repair tasks for the first sub-data ranges in a repair-uncompleted state, and submitting the regenerated first data repair tasks to the task queue.

determining whether the current first sub-data range belongs to the master data range on the current node; and

in response to determining that the current first sub-data range belongs to the master data range on the current node, performing the data repair on the current first sub-data range.

Further, the method also comprises:

after all the first sub-data ranges on the current node are in the repair-completed state, starting a next round of the data repair process.

Further, the method also comprises:

after each round of the data repair process starts, determining a repair period of the current node according to a data size of the master data range and a preset expiration time of deleted data; and

determining a data repair speed according to the repair period, so as to complete a round of repair of all data in the master data range according to the data repair speed within the preset expiration time.

An embodiment of the present invention provides a data storage system, comprising:

multiple nodes which comprise one or more storage devices and one or more processing devices, wherein

the storage device is configured to store master data and/or copy data, and the master data and the copy data corresponding to the same data are stored on the storage devices of different nodes; and

the processing device is configured to repair data on the storage device and during the data repair process, the processing device segments a master data range where the master data on the storage device is located into multiple first sub-data ranges, and performs the data repair on each of the first sub-data ranges respectively, so as to repair inconsistent data between sub-data in the first sub-data ranges and corresponding copy sub-data in the copy data to make them consistent.

Further, when performing the data repair on each of the first sub-data ranges, the processing device generates first data repair tasks corresponding to each of the first sub-data ranges;

the processing device also assigns priorities to the first data repair tasks and then submits them to a task queue; and

the processing device further executes the first data repair tasks from the task queue according to the priority, so that the repair tasks repair the inconsistent data between the sub-data in the first sub-data ranges and the corresponding copy sub-data in the copy data to make them consistent.

Further, the processing device assigns a repair identifier to repaired data in the first sub-data ranges, and identifies the first sub-data ranges as being in a repair-completed state after all data in the first sub-data ranges are assigned the repair identifier.

Further, during the data repair process, the processing device determines a second sub-data range based on a first piece of data that starts to fail in the repair to a first piece of data that starts to succeed in the repair, generates a second data repair task corresponding to the second sub-data range, and assigns a priority to the second data repair task and then submits it to the task queue.

Further, after the node where the processing device is located is recovered from a downtime, the processing device regenerates the first data repair tasks for the first sub-data ranges in a repair-uncompleted state and submits the regenerated first data repair tasks to the task queue.

Further, when starting to perform the data repair on the first sub-data ranges, the processing device determines whether the current first sub-data range belongs to the master data range on a current node and when the current first sub-data range belongs to the master data range on the current node, performs the data repair on the current first sub-data range.

Further, after all the first sub-data ranges on the storage device are in the repair-completed state, the processing device starts a next round of the data repair process.

Further, after each round of the data repair process starts, the processing device determines a repair period according to a data size of the master data range and a preset expiration time of deleted data, and also determines a data repair speed according to the repair period; and the processing device completes a round of repair of all data in the master data range according to the data repair speed within the preset expiration time.

Further, the processing device acquires the sub-data in the first sub-data ranges from the storage device and acquires the copy sub-data corresponding to the sub-data in the first sub-data ranges from nodes where the copy data are located; and

the processing device performs a pairwise comparison between the sub-data and the copy sub-data and repairs the inconsistent data based on a result of the comparison.

An embodiment of the present invention provides a data processing apparatus, comprising:

a first determination module, configured to determine a master data range on a current node, wherein master data in the master data range corresponds to multiple copy data stored on other nodes;

a segmentation module, configured to segment the master data range into multiple first sub-data ranges; and

a repair module, configured to perform data repair on each of the first sub-data ranges respectively, so as to repair inconsistent data between sub-data in the first sub-data ranges and corresponding copy sub-data in the copy data to make them consistent.

The above-described functions may be implemented by hardware, or hardware executing corresponding software. The hardware or software comprises one or more modules corresponding to the above-described functions.

In an example design, the structure of the above-described apparatus comprises a memory and a processor, wherein the memory is configured to store one or more computer instructions that support the above-described apparatus to execute the above-described corresponding method, and the processor is configured to execute the computer instructions stored on the memory. The above-described apparatus may further comprise a communication interface for the above-described apparatus to communicate with other devices or a communication network.

An embodiment of the present disclosure provides an electronic device, comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, and the one or more computer instructions are executed by the processor to implement the method according to any one of the above-described aspects.

An embodiment of the present disclosure provides a computer-readable storage medium configured to store computer instructions used by any one of the above-described apparatuses, comprising relevant computer instructions for executing the method described in any one of the above-described aspects.

The technical solutions provided by the embodiments of the present disclosure may have at least the following beneficial effects:

during the repair process of a distributed system according to the embodiments of the present disclosure, each node automatically inquires and repairs data in a master data range stored thereon and, after segmenting the master data range into first sub-data ranges with finer granularity, performs data repair on the first sub-data ranges. In this way, the present techniques not only overcome the defect of resource waste caused by performing repeated repair on data with multiple copies stored on multiple nodes during the data repair process in the conventional techniques but also can realize breakpoint resume by segmenting the master data range into first sub-data ranges with finer granularity and enable data repair on a node to be controlled in a long time frame by controlling the execution progress of a single repair, avoiding an instantaneous increase in resource consumption.

It should be understood that the foregoing general description and the following detailed description are for exemplary and explanatory purposes only, and are not intended to limit the present disclosure.

BRIEF DESCRIPTION OF DRAWINGS

The features, objectives, and advantages of the present disclosure will become more apparent from the following detailed description of non-limiting implementation manners in conjunction with the accompanying drawings.

FIG. 1 is a flowchart of a data processing method according to an implementation manner of the present disclosure;

FIG. 2 is a schematic diagram of a data repair process of first sub-data ranges according to an implementation manner of the present disclosure;

FIG. 3 is a structural block diagram of a data storage system according to an implementation manner of the present disclosure;

FIG. 4 is a schematic diagram of a data consistency repair architecture in a data storage system according to an implementation manner of the present disclosure;

FIG. 5 is a structural block diagram of a data processing apparatus according to an implementation manner of the present disclosure; and

FIG. 6 is a schematic structural diagram of an electronic device suitable for implementing a data processing method according to an implementation manner of the present disclosure.

DESCRIPTION OF EMBODIMENTS

Hereinafter, exemplary implementation manners of the present disclosure will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement them. Also, for the sake of clarity, parts unrelated to describing the exemplary implementation manners are omitted from the drawings.

In the present disclosure, it should be understood that terms such as “comprising” or “having” are intended to indicate the existence of features, numbers, steps, actions, components, parts, or combinations thereof disclosed in this specification, and are not intended to exclude the possibility that one or more of other features, numbers, steps, actions, components, parts, or combinations thereof may exist or be added.

In addition, it should be noted that the embodiments of the present disclosure and the features of the embodiments may be combined with each other under the condition of no conflict. The present disclosure will be described in detail below with reference to the accompanying drawings and in conjunction with embodiments.

The details of the embodiments of the present disclosure will be described in detail below through specific embodiments.

FIG. 1 is a flowchart of a data processing method according to an implementation manner of the present disclosure. As shown in FIG. 1, the data processing method comprises the following steps:

Step S102: determine a master data range on a current node, wherein master data in the master data range corresponds to multiple copy data stored on other nodes;

Step S104: segment the master data range into multiple first sub-data ranges; and

Step S106: perform data repair on each of the first sub-data ranges respectively, so as to repair inconsistent data between sub-data in the first sub-data ranges and corresponding copy sub-data in the copy data to make them consistent.

For example, the Datastax cluster system uses the Cassandra database to store data, and each node in the cluster system will repair all the stored data (including master data and copy data) at an appropriate time. Before each round of repair, the repaired data will be segmented into many small data segments, and after the segmentation is completed, data in the data segments will be repaired according to a self-defined strategy. The repair process is mainly to reuse Cassandra's read-repair logic. Each piece of data in the data segments will trigger a read operation, and if the read data are found to be abnormal (for example, inconsistent with corresponding data on other nodes), they will be repaired asynchronously. In the whole process, each node will repair data in all the ranges that the node is responsible for, including the master data in the master data range and the copy data in the copy data ranges stored as copies, so if a data sheet in a cluster is stored on three nodes, each of the three nodes will repair the data sheet, three times in total. Therefore, the above data repair solution adopted in the Datastax cluster system will cause repeated calculations, IO operations, etc., which will eventually reduce the repair speed.

For example, the data repair solution adopted in the Scylladb system will set up a buffer pool of the same size both on a master node and on a corresponding copy node, and each time the master node and the copy node will start to read data from the smallest data in the range and fill the buffer pools with the read data, then calculate the hash values corresponding to the two buffer pools, and compare the two hash values to determine a data range to be repaired in the first step. If the harsh values are different, the to-be-repaired data range will be determined from the data that fill the buffer pools according to the minimum set, and then the master data and the copy data are repaired in batches; however, this solution requires multiple hash calculations (including calculation required when determining the data range and calculation required when repairing the data) and consumes many resources; in addition, the granularity of each comparison of this solution is the size of the buffer pools. If only one piece of data in the buffer pools is different, the data of the entire buffer pool size will still be calculated eventually, causing a lot of redundant calculations.

In this embodiment, a data cluster comprises multiple nodes, each of the nodes stores data in a distributed system, and the same block of data in the distributed system may include multiple copies, which are respectively stored on the multiple nodes, for example, on 3 nodes, and one of the nodes stores master data of the block of data, and the other nodes store copy data of the block of data. In order to ensure that the data in the distributed system are not lost, it is necessary to ensure consistency between the master data and the copy data. In order to ensure consistency between the master data and the copy data, in this embodiment of the present disclosure, each of the nodes automatically repairs the master data it is responsible for (if the node also stores copy data of other nodes, the copy data will be repaired by the other nodes rather than the node). It should be noted that multiple different blocks of data can be stored on the same node, and the multiple different blocks of data can be master data or copy data, that is to say, both master data and copy data can be stored on the same node.

Therefore, in the data repair process according to this embodiment of the present disclosure, each of the nodes repairs the master data stored on the node, and the copy data stored on the node are repaired by the node storing the master data corresponding to the copy data. In this way, it can avoid the problem that multiple nodes repeatedly repair the same block of data, thereby saving system resources.

In this embodiment, after a round of repair starts, each of the nodes determines the master data range stored on the current node, that is, the range of the master data stored on the current node, for example, the range from the first record to the last record of the master data. Of course, it can be understood that when the master data of multiple blocks of data are stored on the current node, there may be multiple master data ranges. It should be noted that the master data in the master data range has multiple copy data stored on other nodes. The purpose of data repair is to repair the master data in the master data range and the multiple copy data stored on other nodes, so as to keep consistency between the master data and the copy data.

After the current node determines the master data range, the master data range can be segmented into multiple first sub-data ranges, and the size of the sub-data in the first sub-data ranges can be predefined, for example, the size of the sub-data in the first sub-data ranges is 200M by default. Of course, it can be modified to other sizes in advance if necessary, depending on the actual situation, which is not limited herein. For example, in the segmentation process, the segmentation may start from the smallest data record in the master data range, and every N pieces of data are segmented into a first sub-data range. When there are multiple blocks of master data on the current node, the above-described method is used for segmenting each block of master data.

After multiple first sub-data ranges are obtained by segmentation, data repair may be performed on each of the first sub-data ranges. During the repair process of the first sub-data ranges, for example, starting from the first one of the first sub-data ranges, the data in the current first sub-data range in the master data and the data corresponding to the current sub-data range in the copy data stored on other nodes are read and compared. If they are inconsistent, it can be determined whether the data in the current first sub-data range in the master data are incorrect or the data in the current first sub-data range in the copy data are incorrect. For example, if the current node stores the master data and the other two nodes store the copy data respectively, three pieces of data in the current first sub-data range can be read from the current node and the other two nodes, and by a pairwise comparison of consistency, it can be determined which node has wrong data and repair the wrong data. In some embodiments, during the comparison of data consistency, one key record may be used as the granularity for comparison, and this repair method will not cause a situation in which batches of data are mis-repaired when only individual keys are different.

After the data in all the first sub-data ranges on the current node are repaired, the current node can start a next round of polling and repair and repeat the above-described repair process. In this embodiment of the present disclosure, when a round of repair starts, the time required for the round of repair can be calculated through flow control and according to the data size in the master data range on the current node, and the flow control speed can be controlled to ensure that the round of repair is completed in a preset time frame. The preset time frame is related to a storage policy of a distributed file system. For example, when Cassandra deletes a piece of data, it will perform an insert operation. The newly inserted piece of data is called a tombstone. The biggest difference between a tombstone and a normal record is that the tombstone has an expiration time. When the expiration time is reached, the tombstone data will be actually deleted from the disk when Cassandra performs a compaction operation; therefore, in this embodiment of the present disclosure, when Cassandra is used for storing data, the preset time frame can be set to be the tombstone's expiration time (10 days by default).

During the repair process of a distributed system according to the embodiments of the present disclosure, each node automatically polls and repairs data in a master data range stored thereon and, after segmenting the master data range into first sub-data ranges with finer granularity, performs data repair on the first sub-data ranges. In this way, it not only overcomes the defect of resource waste caused by performing repeated repair on data with multiple copies stored on multiple nodes during the data repair process in the conventional techniques but also can realize breakpoint resume by segmenting the master data range into first sub-data ranges with finer granularity and enable data repair on a node to be controlled in a long time frame by controlling the execution progress of a single repair, avoiding an instantaneous increase in resource consumption.

In an example implementation manner of this embodiment, the Step S106, namely the step of performing data repair on each of the first sub-data ranges respectively, so as to repair inconsistent data between sub-data in the first sub-data ranges and corresponding copy sub-data in the copy data to make them consistent further comprises the following steps:

assigning priorities to the first data repair tasks and then submitting them to a task queue, so that the first data repair tasks are executed from the task queue according to the priorities.

In this example implementation manner, a first data repair task may be started for each of the first sub-data ranges, a priority may also be set for each of the first sub-data ranges according to a preset factor, and the first data repair tasks may be invoked and executed according to the set priority, for example, the first data repair task with a higher priority may be invoked and executed first. In some embodiments, the priority assigned to a first data repair task corresponding to a first sub-data range may indicate the urgency of data repair of the first sub-data range, and the first sub-data ranges requiring urgent repair may be assigned a higher priority, while those that do not require urgent repair may be assigned a lower priority. For example, a first sub-data range at the front of the master data range may be assigned a higher priority, while a first sub-data range at the rear may be assigned a lower priority. In this way, more urgent first sub-data ranges can be repaired before the other first sub-data ranges according to the degree of urgency.