This non-provisional application claims priority under 35 U.S.C. § 119(a) on Patent Application No. 201610643692.0 filed in China, P.R.C. on 2016/08/09, the entire contents of which are hereby incorporated by reference.
The present invention relates to a data storage system, and more specifically, to a data storage system having a function of data de-duplication.
A data de-duplication technology is a data reduction technology, and is usually used in a hard disk; for example, two or more duplicates of a same file may occur in a hard disk. An objective of data de-duplication lies in deleting redundant data in a hard disk, so as to release a storage space for other data to use.
In a known data de-duplication technology, generally, a key value sample database must be first established in a memory, and a key value of each piece of data is calculated by using a central processing unit (CPU) when each piece of data is to be written into a hard disk, and the CPU compares the key value with a key value sample in the sample database; if the comparison result is that the key value is the same as the key value sample in the sample database, the CPU executes a de-duplication program, and if the comparison result is that the key value is different from the key value sample in the sample database, the CPU adds the foregoing key value into the sample database. Therefore, a capacity of the key value sample database is associated with an accuracy of determining whether data is duplicated data. Then, to improve the accuracy, increasing the capacity of the sample database so as to keep all sample key values results in excessive occupation of a memory space, and time for searching for key value samples in a large sample database is along; further, if a memory with a small capacity is used, when a quantity of key value samples has reached an upper limit of the sample database, some of key values are necessarily discarded; in this case, the foregoing accuracy is caused to be reduced. Therefore, how to maintain the accuracy of determining whether data is duplicated data in a limited memory space is really one of current important research and development problems.
Further, the known data de-duplication technology is generally to perform data de-duplication immediately after receiving a write requirement from an output/input (I/O) apparatus, and establish a location relationship table of duplicated data, and store the location relationship table in a volatile memory; if the volatile memory suffers power supply interruption, the duplicated data relationship table is lost and cannot be re-established; to prevent power supply interruption, a non-volatile memory must be used; when there is a large quantity of duplicated data, a huge location relationship table occupies most storable space in the memory. Moreover, after duplicated data is found each time, the location relationship table must be updated immediately, which also occupies CPU resources greatly.
In view of this, the present invention provides a data storage system.
In some embodiments, a data storage system includes a memory, a hard disk, and a processing unit. The memory includes a first logical block and a second logical block. The first logical block includes multiple logical pages, where two logical pages in the logical pages have a first logical address and a second logical address; the first logical block is configured to store a first mapping relationship, where the first mapping relationship provides a mapping relationship between the first logical address and a first physical address, and the first mapping relationship provides a mapping relationship between the foregoing second logical address and a second physical address; the second logical block includes multiple logical pages, and one logical page in the logical pages has a third logical address. The hard disk includes multiple physical pages, where a first physical page, a second physical page, and a third physical page in the physical pages respectively have the first physical address, the second physical address, and the third physical address; the first physical page and the second physical page store a piece of same duplicated data; the two pieces of duplicated data respectively correspond to the first logical address and the second logical address; the processing unit is configured to execute a de-duplication command; when executing the de-duplication command, the processing unit configures the third logical address to be mapped to the third physical address, and stores the duplicated data in the third physical page; moreover, when updating the first mapping relationship, the processing unit makes the first logical address and the second logical address mapped to the third logical address synchronously, and the processing unit further stores a second mapping relationship in the second logical block, where the second mapping relationship provides a mapping relationship between the third logical address and the third physical address.
In some embodiments, the foregoing memory further includes a key value table; the processing unit is further configured to execute a write command and a read command, where the write command includes a piece of written data; when executing the write command, the processing unit does not add a key value of the written data into the key value table; the processing unit determines whether the key value of the written data exists in the key value table; when executing the read command, the processing unit does not determine whether a key value of the read data exists in the key value table, and the processing unit adds a key value of a piece of read data from the hard disk into the key value table.
In some embodiments, the foregoing hard disk further includes an operating system, and the hard disk has a file system compatible with the operating system; where the file system discriminates each piece of written data from an I/O apparatus into multiple file attributes and a file content; each file content is stored in the physical pages, and the processing unit executes the operating system to compare whether each file attribute of at least two pieces of written data in multiple pieces of written data is the same so as to selectively execute the de-duplication command.
In some embodiments, the foregoing file system provides a file indicator, where the file indicator provides a location correspondence between the file attributes and the file content of each piece of the written data; when each file attribute of the at least two pieces of written data in the multiple pieces of written data from the I/O apparatus is the same, the processing unit reads file contents of the foregoing at least two pieces of written data according to the file indicator, and according to the file contents of the foregoing at least two pieces of written data, the processing unit calculates and compares key values of two pieces of written data, so as to selectively execute the de-duplication command.
In some embodiments, when each file attribute of the at least two pieces of the foregoing written data in the multiple pieces of written data is the same, the operating system generates a process identifier (PID) that indicates the de-duplication command, so as to enable the processing unit to execute the de-duplication command.
In some embodiments, the foregoing operating system further generates another PID that indicates a data compression program, so that the processing unit further performs data compression on each piece of written data according to the another PID.
In some embodiments, after executing the de-duplication command, the foregoing processing unit is further configured to execute a garbage collection command; when executing the garbage collection command, the processing unit stores the written data in a fourth physical page in the physical pages; where the fourth physical page has a fourth physical address, and the processing unit updates the second mapping relationship so that the second mapping relationship provides a mapping relationship between the third logical address and the fourth physical address.
In some embodiments, after receiving multiple write commands sent, at different time points, by an I/O apparatus, the foregoing processing unit starts to execute the de-duplication command.
In some embodiments, the addresses in the foregoing first logical block can be accessed by a user, while the addresses in the foregoing second logical block cannot be accessed by the user, and the second logical block can only be used by a system when a data de-duplication program is performed; in other words, the first logical block is readable and writable, and the second logical block is read-only.
In some embodiments, the foregoing processing unit separately writes the duplicated data onto the first physical page and the second physical page; and the memory is a volatile memory; when a power source recovers power supply after supply interruption, the hard disk stores a correspondence between the duplicated data and the first logical address and the second logical address, and the processing unit re-establishes the mapping relationship between the first logical address and the first physical address and the mapping relationship between the second logical address and the second physical address according to the correspondence, and the first physical address and the second physical address where the duplicated data is stored separately.
In summary, according to an embodiment of the data storage system of the present invention, executing, by the processing unit, a de-duplication instruction in an offline manner, and establishing a double-layer mapping relationship between duplicated data and physical addresses can reduce the number of times of updating a mapping relationship between a logical address and a physical address so as to greatly simplify complexity of design of hardware and software, and ensure secession of hard disk writing, and can further restore a location relationship of the duplicated data after power supply interruption; further, performing a duplicated data determining program on a file-layer can reduce the number of times of reading the hard disk; further still, guessing possibility of data duplication meaningfully when executing a write command can store key value samples in a database with a small capacity so as to reduce comparison time and maintain an accuracy of determining whether data is duplicated data.
The present invention will become more fully understood from the detailed description given herein below for illustration only, and thus are not limitative of the present invention, and wherein:
The hard disk 1 is used to store data, and includes multiple physical pages; each physical page has a physical address. As shown in
The memory 3 includes a first logical block 31 and a second logical block 31, where the first logical block 31 and the second logical block 32 respectively covers different address spaces in the memory 3 and include multiple logical pages. For example, by using
The first logical block 31 is used to store a mapping relationship (for the convenience of description, called a first mapping relationship in the following). The first mapping relationship provides mapping between the logical addresses in the first logical block 31 and the physical addresses of the hard disk 1. As shown in
It should be noted that the addresses in the first logical block 31 can be accessed by a user, while the second logical block 32 can only be used by a system when the data de-duplication program is performed, and cannot be accessed by the user. On such basis, when the user operates an I/O apparatus to generate a write requirement, the processing unit 2 executes multiple write commands from the I/O apparatus; where each write command includes written data and a logical address corresponding to the written data. For example, the written data may be “A”, “B”, “C”, and “A”, which are respectively corresponding to the logical addresses “100”, “200”, “300”, and “400” in the first logical block 31. Although the written data “A” is duplicated data, the processing unit 2 does not immediately execute a de-duplication command; the processing unit 2 respectively writes the written data “A”, “B”, “C”, and “A” onto the physical pages 11, 12, 13, and 14 according to the first mapping relationship. After writing the four pieces of written data into the hard disk 1, the processing unit 2 executes other write commands from the I/O apparatus; the processing unit 2 also writes other written data onto other physical pages in the hard disk 1 according to the first mapping relationship. According to
When the processing unit 2 executes the de-duplication command, the processing unit 2 configures a corresponding quantity of logical addresses in the second logical block 32 according to a quantity of the duplicated data in the hard disk 1, and configures the foregoing logical addresses to be one-to-one mapped to the physical addresses in the hard disk 1. For example, if the processing unit 2 wants to perform de-duplication on four pieces of duplicated data in the hard disk 1, then the processing unit 2 can configure four logical addresses in the second logical block 32, and the four logical addresses are separately one-to-one mapped to a physical address. By using that a piece of duplicated data is the written data “A” for example, in the data de-duplication program, the processing unit 2 can configure a logical address (called a third logical address) of the second logical block 32, for example, “80100”, which is mapped to the physical address “10064”. Next, as shown in
Similarly, if another piece of duplicated data is stored in some other physical pages in the hard disk 1, and the another piece of duplicated data corresponds to multiple logical addresses in the first logical block 31, for example, three logical addresses (called a fourth logical address, a fifth logical address, and a sixth logical address in the following), then the processing unit 2 configures another logical address (called a seventh logical address in the following) in the second logical block 32 to be one-to-one mapped to a physical address, for example, “10164”, in the hard disk 1, stores the foregoing another piece of duplicated data in a physical page 17, of which the physical address is “10164”, updates the first mapping relationship, and adds the one-to-one mapping relationship between the seventh logical address and the physical address “10164” into the second mapping relationship, so as to enable the fourth logical address, the fifth logical address, and the sixth logical address to be mapped to the physical address “10164” by means of the seventh logical address in the second logical block 32. Hence, the processing unit 2 can repeat the foregoing steps to transfer all duplicated data in the hard disk 1.
After the processing unit 2 executes the write command, because the processing unit 2 respectively writes two pieces of duplicated data “A” onto the physical pages 12 and 14, in some implementation manners, if the memory 3 is a volatile memory, suppose that power supply interruption of a power source occurs, consequently, data in the memory 3 disappears and the first mapping relationship stored in the first logical block 31 is lost; after the power source recovers power supply to make the data storage system recover power supply, the original first mapping relationship can be re-established according to a location, where each piece of written data is stored, in the hard disk 1, and a correspondence between each piece of written data and a logical address. For example, the hard disk 1 stores correspondences between the written data “B”, “A”, “C”, and “A” and the logical address “100”, “200”, “300”, and “400”, and therefore a mapping relationship between the logical address “100” and the physical address “10000” can be re-established according to that the written data “B” is stored in the physical layer 11; a mapping relationship between the logical address “200” and the physical address “10008” can be re-established according to that the written data “A” is stored in the physical layer 12; a mapping relationship between the logical address “300” and the physical address “10016” can be re-established according to that the written data “C” is stored in the physical layer 13; and a mapping relationship between the logical address “400” and the physical address “10024” can be re-established according to that the written data “A” is stored in the physical layer 14. Then, as compared with the prior art, if power supply interruption occurs after the written data is stored in the physical pages, mapping relationships between the logical addresses and the physical addresses of the duplicated data can also be re-established after power supply recovery.
In some implementation manners, after the multiple write commands at different time points of the I/O apparatus are executed (that is, the processing unit 2 executes the de-duplication command in a post-processing manner), the processing unit 2 starts to determine whether data is duplicated data and transfer the duplicated data. Moreover, after the mapping relationships between the logical addresses and the physical addresses of the duplicated data are stored in the memory 3, the processing unit 2 starts to delete the written data “A” in the physical pages 12 and 14 and keep the written data “A” in the physical page 16. To prevent an occasion where power supply interruption may occur after the data de-duplication is performed, in this implementation manner, the memory 3 may be a non-volatile memory to memorize the mapping relationships between the logical addresses and the physical addresses of the duplicated data.
As shown in
For example, if file attributes of two pieces of written data are different, it indicates that the two pieces of written data are not duplicated data, and then the processing unit 2 does not execute the de-duplication command; otherwise, if the file name, file size, and establishment time of the two pieces of written data are all consistent, it indicates a great possibility that duplicated data is included in the two pieces of written data; in this case, the processing unit 2 may further compare key values of the two pieces of written data without having to compare the key values of the two pieces of written data with all sample values. If the comparison result indicates that the key values of the two pieces of written data are the same, the processing unit 2 executes the de-duplication command.
Further, the first area in the hard disk 1 may further store a file indicator, which provides a location correspondence between each file attribute and each file content. When file attributes of two pieces of written data are the same, the processing unit 2 respectively reads, on physical pages, file contents of the two pieces of written data having the same file attributes according to the file indicator, so as to determine whether key values of the two pieces of written data are the same by means of the two file contents, so as to selectively perform the de-duplication command.
In some other implementation manners, when file attributes of two pieces of written data are the same, the operating system 19 can generate a PID that indicates the de-duplication command, so as to enable the processing unit 2 to execute the de-duplication command according to the PID. In some other embodiments, the PID can be also applied to a data compression technology, that is, when a condition for performing data compression is satisfied, the operating system 19 can generate a PID that indicates a data compression program, so as to enable the processing unit 2 to perform data compression on the written data in the hard disk 1 according to the PID.
In some implementation manners, after the processing unit 2 reads file contents of two pieces of written data from the hard disk 1, the processing unit 2 separately cuts the two file contents. As shown in
In summary, according to an embodiment of the data storage system of the present invention, executing, by the processing unit, a de-duplication instruction in an offline manner, and establishing a double-layer mapping relationship between duplicated data and physical addresses can reduce the number of times of updating a mapping relationship between a logical address and a physical address so as to greatly simplify complexity of design of hardware and software, and ensure secession of hard disk writing, and can further restore a location relationship of the duplicated data after power supply interruption; further, performing a duplicated data determining program on a file-layer can reduce the number of times of reading the hard disk; further still, guessing possibility of data duplication meaningfully when executing a write command can store key value samples in a database with a small capacity so as to reduce comparison time and maintain an accuracy of determining whether data is duplicated data.
Although the present invention has been described in considerable detail with reference to certain preferred embodiments thereof, the disclosure is not for limiting the scope of the invention. Persons having ordinary skill in the art may make various modifications and changes without departing from the scope and spirit of the invention. Therefore, the scope of the appended claims should not be limited to the description of the preferred embodiments described above.
Number | Date | Country | Kind |
---|---|---|---|
2016 1 0643692 | Aug 2016 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
5737742 | Achiwa | Apr 1998 | A |
9514146 | Wallace | Dec 2016 | B1 |
20090234870 | Bates | Sep 2009 | A1 |
20100042790 | Mondal | Feb 2010 | A1 |
20100217952 | Iyer | Aug 2010 | A1 |
20110296087 | Kim | Dec 2011 | A1 |
20160179395 | Fisher | Jun 2016 | A1 |
Number | Date | Country | |
---|---|---|---|
20180046394 A1 | Feb 2018 | US |