The present invention relates to a storage system for Internet Of Things (IOT) backup in data centers and an associated method. More particularly, the present invention relates to a storage system for IOT backup in data centers with distributed deduplication technology to off-load the deduplication processing efforts from storage system to edge components connected thereto, and to scatter the big deduplication table data in centralized storage system to all the storage units.
Data centers are where huge amount of digital data are stored for access. As time goes by, the same data may be packaged in different formats, e.g. a statistic chart embedded in an excel file or a word file, respectively. It occupies storage space for the same data and thus causes waste of storage space. On the other hand, for continuous data inputted from a single source, repeated data also lower performance of the data centers. This is quite often seen in a stream updating monitoring video that contains a number of continuous frames with one or more corners keeping still. This is not only another kind of waste of storage space, but also a bottleneck for data transmission in limited bandwidth network environments.
In order to settle the above issues, there are many deduplication methods available in the prior arts. A commonly seen method is to use a deduplication table (DDT) for a storage system in the data center. Conventionally, DDTs work as follows: chunking a file into blocks or variable-sized units; fingerprinting each block or variable-sized unit as cryptographically secure hash signature, e.g., SHA-1; and indexing the hash signatures with storage locations for identification and elimination of duplications. The DDT is usually kept in a RAM module for the storage system. The rule of thumb for DDT size calculation in The Z File System (ZFS) is every 1-TB data in the storage space needs around 5-GB size of RAM module for the DDT. Other file systems share pretty much the similar figure. For a ZB-level data center, the size of DDT would extend to 5 EB. It would become an unaffordable cost.
In view of the above, it is desired to have a method for effectively reducing the burden of DDT in the data centers. A system utilizing the method, which can reduce storage space by eliminating duplicate data while minimize transmission of redundant data in limited bandwidth network environments, is highly expected, especially when the requirements of IOT increase.
This paragraph extracts and compiles some features of the present invention; other features will be disclosed in the follow-up paragraphs. It is intended to cover various modifications and similar arrangements included within the spirit and scope of the appended claims.
In order to settle the issues above, a method for achieving distributed deduplication for a storage system for IOT backup in a data center is provided. The method includes the steps of: a) providing a deterministic function to control units each for one storage unit in a storage system and an edge component linked to the storage system; b) dividing a To-Be-Backup Data (TBBD) in the edge component into a plurality of To-Be-Stored Chunks (TBSC) in premeditated size by the edge component; c) calculating a hash value for each TBSC by the deterministic function by the edge component; d) calculating a To-Be-Stored Destination (TBSD) for each TBSC by the deterministic function by the edge component; e) checking if one TBSC already exists at a corresponding TBSD by a control unit in the storage unit chosen by the deterministic function; f) transmitting the TBSC(s) to the corresponding TBSD(s) where no TBSC exists and the associated hash value(s) to the control unit(s); g) storing the TBSC(s) in the corresponding TBSD(s) and the hash value(s) in a storage unit(s) chosen by the deterministic function; and h) indexing the stored TBSC(s) with the corresponding hash value(s) and TBSD(s) to the edge component and the control unit(s) in the storage unit(s).
Preferably, the deterministic function may be driven by variables of hash values, resilience schemes, distribution rules for storage units, Quality of Service (QoS) policy or Service Level Agreement (SLA) policy. The method may further include after step (h) the steps of: i) checking if all stored TBSC(s) are kept in the corresponding TBSD(s) periodically by the control units in the corresponding storage units; and j) if the result of step (i) is no, restoring the lost stored TBSC(s). The method may also include between step (b) and step (c) a step of: b1) encoding the TBSCs to have a plurality of To-Be-Stored Parities (TBSP). The method may even further include between step (b) and step (e) the steps of: c1) calculating a hash value for each TBSP by the deterministic function by the edge component; and d1) calculating a TBSD for each TBSC and each TBSP by the deterministic function by the edge component.
The present invention also provides another method for achieving distributed deduplication for a storage system for IOT backup in a data center. The method includes the steps of: a) providing a deterministic function to control units each for one storage unit in a storage system and an edge component linked to the storage system; b) dividing a TBBD in the edge component into a plurality of TBSCs in premeditated size by the edge component; c) calculating a hash value for each TBSC by the deterministic function by the edge component; d) calculating a TBSD for each TBSC of N replicas of the TBBD by the deterministic function by the edge component; e) checking if the TBSCs of the first replica already exist at corresponding TBSDs by the control units; f) transmitting the TBSC(s) having no TBSC existing at its TBSD with associated TBSDs of the same TBSC(s) in other replica(s) to the corresponding TBSD(s) and the associated hash value(s) to the control unit(s); g) storing the TBSC(s) in the corresponding TBSD(s) and the hash value(s) in a storage unit(s) chosen by the deterministic function; h) replicating the TBSC(s) transmitted to the TBSD(s) of the same TBSC(s) in other replica(s); and i) indexing the stored TBSC(s) with the corresponding hash value(s) and TBSD(s) to the edge component and the control unit(s) in the storage unit(s).
Preferably, the deterministic function may be driven by variables of hash values, resilience schemes, distribution rules for storage units, QoS policy or SLA policy. The method may further include after step (h) the steps of: j) checking if all stored TBSC(s) are kept in the corresponding TBSD(s) periodically by the control units; and k) if the result of step (j) is no, making a new replica for the lost stored TBSC(s). The method may also include between step (b) and step (c) a step of: b1) encoding the TBSCs to have a plurality of TBSPs. The method may even further include between step (b) and step (e) the steps of: c1) calculating a hash value for each TBSP by the deterministic function by the edge component; and d1) calculating a TBSD for each TBSP by the deterministic function by the edge component.
According to the present invention, a storage system of distributed deduplication achieved by the method above for IOT backup in a data center is disclosed. The storage system may include: a number of storage units, each having a number of TBSDs; a control unit, for controlling operations of the storage unit; and a distributed deduplication module, for providing or updating the deterministic function to the control unit and the edge component, and executing each step of the method in the control unit and/or the edge component. Preferably, the distributed deduplication module may be hardware or software installed in the control unit.
The present invention will now be described more specifically with reference to the following embodiments.
Please refer to
The edge components are all devices or equipment linked to the storage system 10 over a network 300, embedded with electronics, software, sensors, actuators, and network connectivity that enable these edge components to collect and exchange data. The collected data need to be backed up in the data center (storage system 10) for further use or analysis. The edge components may be a personal computer 410 to upload homemade videos to share with others, a smart phone 420 using a social communication app to exchange messages with the help of the storage system 10, an embedded sensor 430 in a smart shirt to keep recording body temperature and store the data to the storage system 10 for analysis, a monitor 440 to watch crowds in a gate of a store and back up monitored video in the storage system 10, and a remote tracking device 450 installed in a rental car to trace the car. Each edge component represents a scenario of the application of the present invention. It is clear that no matter which application takes place, deduplication of data sent to the storage system 10 is necessary in case the storage system 10 will be occupied with redundant data soon. In the present invention, a new means, distributed deduplication, is provided. It means deduplication is no longer implemented by the storage system 10 (control units) only. Instead, the whole processes can be achieved by the storage system 10 and the edge components linked thereto. Loading of the storage system 10 can therefore be reduced. The methods for achieving distributed deduplication for the storage systems for IOT backup in a data center are disclosed below with detailed description of embodiments.
Assume a user uses the personal computer 410 to upload his video to the storage system 10 where a workload of video sharing runs to share the video to whom are interested in. The video contains some fragments that come from movie clips and the movie clips may already leave a backup in a storage unit of the storage system 10. In order to deduplicate these fragments and save storage space, the method provided by the present invention can be applied. Please see
The second step of the method is dividing a TBBD in the personal computer 410 into a number of TBSCs in premeditated size by the personal computer 410 (S02). The TBBD is the video file in this case. Take the premeditated size as 512 Kbits as a size of a block in a storage unit. Suppose the video file is 4000 Kbits in size. There are 8 TBSCs (C1 to C8 shown in the first row of the table in
A following step is to calculate a TBSD for each TBSC by the deterministic function by the personal computer 410 (S04). Please see
Next, store the TBSC(s) in the corresponding TBSD(s) and the hash value(s) in the storage unit(s) chosen by the deterministic function (S08). In step S08, the locations of the hash values are not assigned by any specific rules. It depends on the operation of deterministic function to find suitable locations. As illustrated above, the storage unit includes many TBSDs. The TBSD is a minimum storage element reserved for a TBSC, while the storage unit is simply used to keep the hash value(s) no matter which TBSDs are assigned to do the job.
A following step is indexing the stored TBSC(s) with the corresponding hash value(s) and TBSD(s) to the personal computer 410 and the control unit(s) in the storage unit(s) (S09). This step means since a new TBSC is stored to the corresponding TBSD, the corresponding hash value and TBSD should be acknowledged by all parties. The indexes may be kept in the control units or some TBSDs in the storage units of the storage system 10, and a sand box in a memory or a storage of the personal computer 410. From
The final step is to check if all stored TBSC(s) are kept in the corresponding TBSD(s) periodically by the control units (S10). For some reasons, e.g. one stored TBSD been carelessly deleted, the stored TBSC is lost. The lost TBSC needs to be restored to keep the system synced up and consistent. So, if there is any stored TBSC(s) found lost, just restore the lost stored TBSC(s) (S11). This can be done with the indexed hash value to reverse derivate. If there is no stored TBSC(s) found lost, remain all TBSC(s) in the corresponding TBSD(s) (S12). Step S10 processes again and again to ensure no stored TBSC backed up in the storage system 10 will be gone.
In the above embodiment, it shows the method for general TBBD. According to the spirit of the present invention, there is another method for the general TBBD with its parities for error check. Below is another embodiment for this method.
Please refer to
From
The above two embodiments apply when no replica is required. For safety reason, some data need replicas. Since data transmitted and spaces for storage are large, for this situation, the present invention provides other methods to deal with. Two more embodiments below are used to introduce associated methods.
Assume the embedded sensor 430 keeps sending body temperature and related messages to the storage system 10 for analysis. For a healthy body, the information should remain stable with time. Thus, there might be many data unchanged during a period of time. This is a good example for applying the method of the present invention. Please see
The next step is calculating a TBSD for each TBSC of N replicas of the TBBD by the deterministic function by the embedded sensor 430 (S24). N is a positive integer. It means the method can work for any number of replicas. In this embodiment, N is 3. Please refer to
The following step is checking if the TBSCs of the first replica already exist at corresponding TBSDs by the control units (S25). If the answer is yes, remain the TBSC(s) in the corresponding TBSD(s) (S26); if the answer is no, transmit the TBSC(s) having no TBSC existing at its TBSD with associated TBSDs of the same TBSC(s) in other replica(s) to the corresponding TBSD(s) and the associated hash value(s) to the control unit(s) (S27). For a better understanding, please come back to
The next step is replicating the TBSC(s) transmitted to the TBSD(s) of the same TBSC(s) in other replica(s) (S29). Intuitively, this step is to make extra two replicas. However, it is not the same as a commonly applied replication. The locations, TBSDs, have already determined by the deterministic function. Next, index the stored TBSC(s) with the corresponding hash value(s) and TBSD(s) to the edge component and the control unit(s) (S30). It should be emphasized that in this embodiment, indexing is for all three sets of TBSCs of the replicas, not only for the first replica. Data indexed are shown in
A final step is checking if the all stored TBSC(s) are kept in the corresponding TBSD(s) periodically by the control units (S31). The purpose of step S31 is the same as that of step S10 in the previous embodiments. The lost TBSC needs to be restored. So, if there is any stored TBSC(s) found lost, make a new replica for the lost stored TBSC(s) (S32). If there is no stored TBSC(s) found lost, remain all TBSC(s) in the corresponding TBSD(s) (S33). Step S31 processes again and again to ensure no stored TBSC of the three replicas in the storage system 10 will be vanished.
Similarly, in the above embodiment, it shows the method for general TBBD in several replicas. According to the spirit of the present invention, there is another method for the general TBBD with its parities for error check and one replica for safety reasons. Below is another embodiment for this method.
Please refer to
While the invention has been described in terms of what is presently considered to be the most practical and preferred embodiments, it is to be understood that the invention needs not be limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements included within the spirit and scope of the appended claims, which are to be accorded with the broadest interpretation so as to encompass all such modifications and similar structures.