This application claims priority under 35 U.S.C. §119 to Korean Patent Application No. 10-2012-0009067 filed on Jan. 30, 2012, the subject matter of which is hereby incorporated by reference.
The inventive concept relates generally to electronic memory technologies. More particularly, the inventive concept relates to techniques for performing data deduplication in memory systems.
Data deduplication is a technique that reduces the amount of occupied storage space in a memory device or system by eliminating redundant data. As an example, a mail server may perform data deduplication to eliminate redundant copies of an email attachment that has been sent to multiple accounts associated with the mail server. Data deduplication typically involves storing a single unique copy of a unit of data and replacing each redundant copy of the data with a pointer to the unique copy.
In many systems, data deduplication is performed in units referred to as “chunks”. For example, a system may divide input data into multiple chunks, determine whether any of the chunks are identical to each other or to data already stored in the system, and remove redundant chunks based on the determination.
One shortcoming of data deduplication is that it tends to increase the operating overhead of a system. In other words, processing time is required to perform data deduplication, which may potentially reduce the overall performance of the system.
In one embodiment of the inventive concept, a system comprises a pre-processor that receives a data file and determines a type of the data file, a chunking module that chunks the data file to produce a plurality of chunks, a hash engine that generates a hash value for a chunk among the plurality of chunks, a finger print detector that determines whether the hash value matches an entry within a portion of an index table corresponding to the type of the data file, and a storage medium that stores the chunk or a pointer to the chunk according to a result of the determination performed by the finger print detector.
In another embodiment of the inventive concept, a method of performing data deduplication comprises determining a type of an input data file, and performing deduplication on the data file by a first method if the data file is of a first type, and performing deduplication of the data file by a second method if the data file is of a second type different from the first type.
In another embodiment of the inventive concept, a method of performing data deduplication comprises generating a plurality of chunks from an input data file using a first method or a second method according to a type of the input data file, determining whether a copy of a selected chunk among the plurality of chunks is already stored in a storage medium, and selectively storing the selected chunk in the storage medium according to a result of the determination.
These and other embodiments of the inventive concept can potentially perform data deduplication with greater efficiency compared with conventional technologies.
The drawings illustrate selected embodiments of the inventive concept. In the drawings, like reference numbers indicate like features.
Embodiments of the inventive concept are described below with reference to the accompanying drawings. These embodiments are presented as teaching examples and should not be construed to limit the scope of the inventive concept.
In the description that follows, the terms “a”, “an”, “the”, and similar referents shall encompass the singular and the plural forms, unless indicated to the contrary. Terms such as “comprising”, “having”, “including”, “containing”, etc., are to be construed as open-ended terms unless indicated to the contrary.
The terms first, second, etc. may be used herein to describe various features, but the described features are not to be limited by these terms. Rather, these terms are used merely to distinguish between different features. Accordingly, a first feature discussed below could be termed a second feature, and vice versa, without changing the meaning of the relevant description.
The term “module”, as used herein, refers to, but is not limited to, a software component, a hardware component, or a combination thereof, such as a Field Programmable Gate Array (FPGA) or Application Specific Integrated Circuit (ASIC), which performs certain tasks. A module may reside in an addressable storage medium and be executed on one or more processors. For example, a module may include, for instance, software components, object-oriented software components, class components, and task components, processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables. The functionality provided by these features may be combined into fewer components or separated into further components. In other words, the functionality defined by a module can be partitioned in fairly arbitrary ways between various hardware components, software components, etc.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art. The use of any and all examples, or example terms provided herein is intended merely to better illuminate the inventive concept and is not to limit the scope of the inventive concept. Further, unless indicated otherwise, terms defined in generally used dictionaries are not to be interpreted in an overly formal sense.
Referring to
Referring to
Chunking module 120 divides the data file into a plurality of chunks in a procedure referred to as “chunking”. Chunking module 120 performs chunking on the data file using one of a first method and a second method according to the type of data file. That is to say, chunking module 120 may employ a different chunking method according to the type of data file received. In some embodiments, the first method comprises content based chunking of the data file and the second method comprises offset based chunking in which the data file is chunked by a predetermined offset from a file starting point. In some embodiments, the first method comprises content defined chunking (CDC) and the second method comprises static chunking (SC).
Hash engine 130 generates a hash value for each chunk. In particular, hash engine 130 applies a predetermined hash function to each chunk chunked by chunking module 120 to generate a hash value of each chunk. The generated hash value for each chunk may be referred to as a finger print of the chunk.
Finger print detector 140 determines whether the hash value of each chunk is already stored in a portion of an index table 150 corresponding to the type of the data file. For example, if the type of the data file received from pre-processor 110 is determined as an A type (e.g., a doc file type), finger print detector 140 determines whether the hash value of each chunk of the input data file exists in a portion “A” of index table 150.
Hash values of the respective chunks for A type data files stored in storage medium 12 are stored in index table 150 corresponding to A type. Therefore, if there is a hash value of a target chunk in index table 150 corresponding to A type, the target chunk is pre-stored in storage medium 12 (e.g., from a previous storage operation). Accordingly, the target chunk is not redundantly stored to storage medium 12, and instead only a pointer to the target chunk is stored in storage medium 12. Where there is no hash value of the target chunk in index table 150 corresponding to A type, the target chunk is not pre-stored in storage medium 12 and the data itself is stored to storage medium 12. In addition, the hash values of index table 150 corresponding to A type are updated.
Finger print detector 140 does not determine whether the hash value of each chunk is stored in other portions of index table 150 corresponding to different types of data files, such as B, C and D types (e.g. one of a jpg file type and an avi file type). In other words, finger print detector 140 inspects only a portion of index table 150 corresponding to the same type as the type of the input data file.
Referring to
Next, it is determined whether the data file type is a type requiring CDC (S120). Where the data file type is a type requiring CDC (S120=Y), CDC is performed on the data file (S130). Otherwise (S120=N), SC is performed on the data file (S140). In some embodiments, chunking module 120 selects one of CDC and SC according to the data file type and chunks the data file.
Next, a finger print (hash value) of each chunk is generated and it is determined whether the finger print is stored in the index table (S150). If the finger print is stored in the index table, indicating that a corresponding target chunk comprises data that is pre-stored in storage medium 12 (S150=Y), a pointer indicating the data that is pre-stored in storage medium 12 is stored to storage medium 12 (S160). Otherwise (S150=N), the data is stored to storage medium 12 and index table 150 is updated (S170). In some embodiments, hash engine 130 generates a hash value of each chunk, and finger print detector 140 determines whether the hash value of each chunk exists in a portion of index table 150 corresponding to the data file type.
In the embodiment of
Referring to
In addition, the percentage of redundant data for some file types may be considerably different based on the chunking method used. In particular, data redundancy for data file types B, C and D varies considerably when the chunking method changes from SC8 to CDC8, compared to the data file types A, E, F and G.
Based on the information shown in
In addition, where pre-processor 110 determines the input data file type as one of the file types B, C and D, adequate data deduplication can be performed simply by comparing input data with units of data of the same data file type. By performing data deduplication in this manner, the size of index table 150 can be reduced, and a short time may be required to compare hash values, both of which can improve overall system performance.
Referring to
Referring to
Referring to
During typical operation, host device 20 stores the data file in temporary storage 14 of storage device 10, and data deduplication system 100 performs deduplication on the data file stored in temporary storage 14 when storage device 10 is in an idle state. As the result of the data deduplication, data without redundancy with respect to the data stored in storage medium 12 is newly stored in storage medium 12.
Referring to
Controller 1200 is connected to a host device and a nonvolatile memory device 1100. In response to a request from the host, controller 1200 accesses nonvolatile memory device 1100. For example, controller 1200 is configured to control read, write, erase and background operations of nonvolatile memory device 1100. Controller 1200 is configured to provide interfacing between nonvolatile memory device 1100 and the host. Controller 1200 is configured to drive firmware for controlling nonvolatile memory device 1100.
Controller 1200 typically further comprises well known components such as a random access memory (RAM), a processing unit, a host interface, and a memory interface. The RAM may be used as at least one of an operation memory of the processing unit, a cache memory between nonvolatile memory device 1100 and the host, and a buffer memory between nonvolatile memory device 1100 and the host. The processing unit may control every operation of controller 1200.
The host interface implements a protocol to exchange data between the host and controller 1200. For example, controller 1200 may be configured to communicate with the host through one of various standard interface protocols such as Universal Serial Bus (USB), multimedia card (MMC), peripheral component interconnection (PCI), peripheral component interconnection-express (PCI-E), advanced technology electronics (ATA), serial-ATA, parallel-ATA, small computer small interface (SCSI), enhanced small disk interface (ESDI), and integrated drive electronics (IDE). The memory interface of controller 1200 may interface with nonvolatile memory device 1100. For example, the memory interface may include an NAND interface and an NOR interface.
Memory system 1000 may further comprise an error correction block to detect and correct errors in data read from nonvolatile memory device 1100 using an error correction code (ECC). The error correction block may be provided as a component of controller 1200 or nonvolatile memory device 1100.
Controller 1200 and nonvolatile memory device 1100 can be integrated in one semiconductor device. In an example embodiment, controller 1200 and nonvolatile memory device 1100 may be integrated in one semiconductor device to constitute a memory card. For example, controller 1200 and nonvolatile memory device 1100 may be integrated in one semiconductor device to constitute a PC card (PCMCIA), a compact flash card (CF), a smart media card (SM/SMC), a memory stick, a multimedia card (MMC, RS-MMC, MMCmicro), a SD card (SD, miniSD, microSD), a universal flash memory device (UFS).
In some embodiments, controller 1200 and nonvolatile memory device 1100 are integrated in one semiconductor device to form a solid state disk/drive (SSD). The SSD may include a storage device configured to store data to a semiconductor memory. Where memory system 1000 is used as an SSD, an operation speed of the host connected to memory system 1000 may be improved significantly.
In some embodiments, memory system 1000 may be applied to one of a computer, a portable computer, an Ultra Mobile PC (UMPC), a workstation, a net-book, a Personal Digital Assistant (PDA), a web tablet, a wireless phone, a mobile phone, a smart phone, an e-book, a portable multimedia player (PMP), a portable game device, a navigation device, a black box, a digital camera, a 3-dimensional television, a digital audio recorder, a digital audio player, a digital picture recorder, a digital picture player, a digital video recorder, a digital video player, a device capable of transmitting/receiving data in an wireless environment and various electronic devices constituting a home network, one of various electronic devices constituting a computer network, one of various electronic devices constituting a telematics network, a radio-frequency identification (RFID) device, or one of various constituents constituting a computing system.
Nonvolatile memory device 1100 or memory system 1000 may be packaged using various package types or package configurations, such as Package on Package (PoP), Ball grid arrays (BCAs), Chip Scale Packages (CSPs), Plastic Leaded Chip Carrier (PICC), Plastic Dual in-Line Package (PDIP), Die in Waffle Pack, Die in Wafer Form, Chip On Board (COB), Ceramic Dual In-Line Package (CERDIP), Plastic Metric Quad Flat Pack (MQFP), Thin Quad Flatpack (TQFP), Small Outline (SOIC), Shrink Small Outline Package (SSOP), Thin Small Outline (TSOP), Thin Quad Flatpack (TQFP), System in Package (SIP), Multi Chip Package (MCP), Wafer-level Fabricated Package (WFP), or Wafer-Level Processed Stack Package (WSP).
Referring to
Referring to
Memory system 2000 is electrically connected to CPU 3100, RAM 3200, user interface 3300 and to power supply 3400 through a system bus 3500. The data supplied through user interface 3300 or the data processed by CPU 3100 is stored to memory system 2000. Nonvolatile memory device 2100 is connected to system bus 3500 through controller 2200. However, nonvolatile memory device 2100 may be directly connected to system bus 3500.
Although computing system 3000 is shown with memory system 200 of
The foregoing is illustrative of embodiments and is not to be construed as limiting thereof. Although a few embodiments have been described, those skilled in the art will readily appreciate that many modifications are possible in the embodiments without materially departing from the novel teachings and advantages of the inventive concept. Accordingly, all such modifications are intended to be included within the scope of the inventive concept as defined in the claims.
Number | Date | Country | Kind |
---|---|---|---|
10-2012-0009067 | Jan 2012 | KR | national |