The present invention relates to an apparatus and method for eliminating duplication of a file in a distributed storage system (DSS), and more specifically, to an apparatus and method for examining duplication of an active file and eliminating duplication of the file using a hash algorithm, bit level comparison and the like in the process of operating a distributed storage system.
A distributed storage system or a parallel storage system is a storage system which virtualizes a plurality of storage devices as one storage device. Such a distributed storage system does not store one file in one storage device, but the file is duplicated, stored and used in a plurality of virtualized storage devices in a distributed manner.
As an existing Redundant Array of Inexpensive Devices (RAID) storage device integrates a plurality of hard disks into one storage device to construct a further larger, further faster and further stable storage device, the distributed storage system may provide functions of a further larger, further faster and further stable storage system by configuring a plurality of storage devices into one storage device.
Such a distributed storage system technique is used as a core technique in cloud computing or the like, and if the number of storage devices configuring the distributed storage system increases further more, capacity and performance of the distributed storage system are proportionally enhanced, and cost-effectiveness of the Total Cost of Owner-ship is maximized. Therefore, the distributed storage system may provide high-level performance and expandability which cannot be provided by existing storage systems.
In relation to this,
Referring to
Meanwhile, in such a distributed storage system, a plurality of storage servers is divided into operation servers and backup servers in order to efficiently manage files, and currently operating active files (data or contents) are stored in the operation servers having a good performance, whereas backup files which do not operate currently are stored in the backup servers having a somewhat low performance, and thus limited storage media can be used efficiently.
However, since a file management method according to a conventional technique does not examine duplication of a file in a real operation system and is stored and operates in an operation server, storage and system expansions are needed due to duplicated files. Accordingly, system installation cost is increased, and manpower and cost needed for operating the system are also increased.
When the distributed storage system is associated with systems for backup, Information Lifecycle Management (ILM), remote synchronization, mirror, archive, replication or the like, duplicated files are moved, and thus storage space and network resources of an individual system are wasted.
Therefore, the present invention has been made in view of the above problems, and it is an object of the present invention to provide an apparatus and method for examining duplication of an active file and eliminating duplication of the file using a hash algorithm, bit level comparison and the like in a distributed storage system.
Another object of the present invention is to provide an apparatus and method for eliminating duplication of a file, in which unnecessary storage and system expansions required due to duplicated files are prevented by eliminating the duplicated files (data or contents) in the process of operating a system.
Still another object of the present invention is to provide an apparatus and method for eliminating duplication of a file, in which duplicated files are not transmitted when the distributed storage system is associated with systems for backup, Information Lifecycle Management (ILM), remote synchronization, mirror, archive, replication or the like, and thus unnecessary storage expansion and waste of network resources are prevented in an individual system.
Still another object of the present invention is to provide an apparatus and method which can support various types of hash algorithms when duplication of a file is examined and eliminated in a distributed storage system, examine and eliminate duplication of a file by the unit of file and/or chunk, and examine and eliminate duplication of a file for the whole system, for each volume or for each associated system.
Still another object of the present invention is to provide a distributed storage system efficiently using the apparatus and method for eliminating duplication of a file described above.
To accomplish the above objects, according to one aspect of the present invention, there is provided a file duplication examination apparatus of a distributed storage system, the apparatus including: a fingerprinting unit for calculating a hash value of each chunk for an active file and calculating a secondary hash value by adding the hash values calculated for respective chunks; a duplication examination unit for examining duplication of the file using the hash value of each chunk and the secondary hash value; and a duplicate file elimination unit for eliminating a duplicated file depending on a result of the examination.
According to one aspect of the present invention, there is provided a distributed storage system including: a plurality of storage servers for storing a file in a distributed manner; and a metadata server for managing metadata of the file, wherein the metadata server calculates a hash value of each chunk for an active file and calculating a secondary hash value by adding the hash values calculated for respective chunks, examines duplication of the file using the hash value of each chunk and the secondary hash value, and eliminates a duplicated file depending on a result of the examination.
According to one aspect of the present invention, there is provided a file duplication examination method of a distributed storage system, the method including the steps of: calculating a hash value of each chunk for an active file; calculating a secondary hash value by adding the hash values calculated for respective chunks; examining duplication of the file using the hash value of each chunk and the secondary hash value; and eliminating a duplicated file depending on a result of the examination.
According to the present invention, files can be managed efficiently by examining and eliminating duplication of active files using a hash algorithm, an algorithm of its own and the like in a distributed storage system.
According to the present invention, unnecessary storage and system expansions required due to duplicated files are prevented by eliminating duplicated files (data or contents) in the process of operating a system, and thus system installation cost, as well as manpower and cost needed for operating the system, is saved.
In addition according to the present invention, duplicated files (data or contents) are not transmitted by examining duplication of files in a real operation system when the distributed storage system is associated with systems for backup, Information Lifecycle Management (ILM), remote synchronization, mirror, archive, replication or the like, and thus waste of storage space and network resources of an individual systems can be prevented.
The preferred embodiments of the present invention will be hereafter described in detail, with reference to the accompanying drawings. Furthermore, in the drawings illustrating the embodiments of the present invention, elements having like functions will be denoted by like reference numerals and details thereon will not be repeated.
First,
Referring to
Referring to
Describing additionally, the file duplication elimination apparatus according to the present invention is configured as a separate apparatus or server in a distributed storage system (refer to
In relation to this,
In addition,
Meanwhile,
Hereinafter, an apparatus and method for eliminating duplication of a file in a distributed storage system according to the present invention will be described with reference to
First, referring to
For example, the fingerprinting unit 241 and 321 calculates a hash value by the unit of chunk for a currently operating active file using a certain hash algorithm (MD2, MD4, MD5, SHA, SHA-1, RIPEMD160, or DSS-1) (refer to S610 of
In relation to step S630, according to a preferred embodiment of the present invention, the hash value of a chunk unit is included in the chunk header and the metadata payload, and the hash value of a file unit (secondary hash value) is included in the metadata header. Specifically, the file duplication elimination apparatus according to the present invention calculates a hash value of a chunk unit and a hash value of a file unit and transmits the calculated hash values to the metadata server, and the metadata server creates or updates metadata of a corresponding file by including the file unit hash value in the metadata header and the chunk unit hash value in the metadata payload and.
In addition, according to a preferred embodiment of the present invention, the chunk unit hash value and the file unit hash value are stored in memory and the database in the form of a hash value management table. Specifically, a chunk unit hash value management table is stored in the memory of an individual storage server (individual operation server) storing corresponding chunks, and a file unit hash value management table is stored in the memory of the file duplication elimination apparatus (file duplication elimination server). In addition, the chunk unit hash value management table and/or the file unit hash value management table are stored in a database, and here, the database may be provided within the file duplication elimination apparatus (file duplication elimination server) according to the present invention or provided in the form of a separate database server. Since the present invention is implemented in this manner, a hash value of a file and/or a chunk does not need to be detected every time, and particularly, the hash values do not need to be detected again in a situation where restoration is needed, such as restart of the file duplication elimination apparatus (file duplication elimination server), restart of an individual storage server (individual operation server), or reinstallation of a database.
Meanwhile, the duplication examination unit 242 and 322 of the file duplication elimination apparatus according to the present invention examines duplication of a currently operating file with reference to the hash management table described above.
For example, the duplication examination unit 242 and 322 performs a primary duplication examination on an operating file by reviewing duplication, referring to the file unit hash value management table and/or the chunk unit hash value management table based on file unit hash value and/or the chunk unit hash value (refer to S710 of
If the file is determined as being duplicated as a result of the examination performed by the duplication examination unit 242 and 322, the duplicate file elimination unit 243 and 323 of the file management apparatus according to the present invention eliminates relevant files (refer to S730 of
In relation to duplication examination and elimination of a file, according to a preferred embodiment of the present invention, duplication examination and elimination by the unit of file may be performed by the file duplication elimination apparatus (file duplication elimination server) (refer to
Meanwhile, elimination of a duplicated file may be elimination of a file or a chunk itself, or elimination of the duplicated file can be performed by creating, modifying and deleting a chunk unit pointer for the file. For example, in the case of a file creation process, if a file is duplicated as a result of performing duplication examination on the file, a chunk unit pointer of the file is modified, and the file is deleted. In the case of file deletion process, only the chunk unit pointer of the file is deleted, and in the case of file copy process, only a chunk unit pointer of the file is created.
Finally, referring to
Describing in short, the metadata management unit 324 creates and manages metadata of the files stored in a plurality of storage servers (operation servers and backup servers) in a distributed manner, and the storage device management unit 325 manages information on performance and capacity of the plurality of storage servers. Accordingly, the file duplication elimination apparatus according to the present invention may further efficiently manage the files in association with the metadata management unit 324 and/or the storage device management unit 325.
Meanwhile, the method of eliminating duplication of a file in a distributed storage system according to the present invention may be embodied through a computer readable recording medium containing program commands for performing operations implemented in a variety of computers. The computer readable medium may include program commands, data files, data structures and the like in a single or combined form. The recording medium may be a medium that is specially designed and configured for the present invention or medium that is publicized and available for those skilled in the computer software art. Examples of the computer readable medium include magnetic media such as a hard disk, a floppy disk and a magnetic tape, optical media such as a CD-ROM and a DVD, magneto-optical media such as a floptical disk, and hardware devices specially configured to store and execute the program commands, such as ROM, RAM and flash memory. Examples of the program commands include high-level language codes that can be executed by a computer using an interpreter or the like, as well as machine codes such as those generated by a compiler.
While the present invention has been described with reference to the particular illustrative embodiments, it is not to be restricted by the embodiments but only by the appended claims. It is to be appreciated that those skilled in the art can change or modify the embodiments without departing from the scope and spirit of the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10-2009-0113516 | Nov 2009 | KR | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/KR2010/007764 | 11/4/2010 | WO | 00 | 4/3/2012 |