Embodiments described herein generally relate to data deduplication, and more specifically, relate to determining optimal data size for a data deduplication operation.
Various techniques may be used to provide data deduplication. In general, data deduplication may refer to a process to eliminate duplicate copies of data stored in a computer system. For example, unique data blocks may be stored at a storage resource. As a subsequent data block is received to be stored at the storage resource, the data blocks currently stored at the storage resource may be compared with the subsequent data block. If there are no copies of the subsequent data block currently stored at the storage resource, then the subsequent data block may be stored at the storage resource. Otherwise, if one of the data blocks currently stored at the storage resource is a duplicate of the subsequent data block, then the subsequent data block may not be stored at the storage resource. Instead, a reference to the location in the storage resource where the currently stored data block that is the duplicate of the subsequent data block may be provided.
The disclosure will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the disclosure. The drawings, however, should not be taken to limit the disclosure to the specific embodiments, but are for explanation and understanding only.
Aspects of the present disclosure are directed to determining an optimal data size for a data deduplication operation. In general, the data deduplication operation may be used to store data at a storage system that may be represented by one or more storage devices. Examples of the storage devices may include, but are not limited to, a solid-state drive, hard disk drives, etc. The storage system may thus include a group or cluster of storage devices.
The data deduplication operation may be performed as data objects (e.g., files) are received to be stored at the storage system (i.e., inline deduplication). Each data object may be divided or separated into multiple separate data blocks (i.e., chunks). Each of the data blocks for a data object may then be compared with other data blocks that are currently stored at the storage system. In some embodiments, to perform a faster comparison, instead of directly comparing data blocks, the data deduplication operation may compare hash values. For example, the data deduplication operation may perform a hash function on each of the data blocks to calculate hash values for each of the data blocks that correspond to a data object. The hash values for a particular data block may then be compared with the hash values of other data blocks that are currently stored at the storage system. For example, the hash value for a data block of a received data object may be compared with hash values for data blocks that were previously received and stored at the storage system. If the hash value for the particular data block matches with any hash value of a data block currently stored at the storage system, then the received data block may be considered a duplicate of another data block that is currently stored at the storage system. Instead of storing the received data block, a reference (e.g., a pointer) to the duplicate data block that has already been stored is stored Otherwise, if the hash value of the received data block does not match with any hash values of data blocks currently stored, then the data block may be stored at the storage system along with its hash value, the latter for use in future comparisons against any subsequently received data block. Thus, the data deduplication operation may be performed for each individual data block of the data object by comparing each individual data block (or its corresponding hash value) to another data block that is currently stored at the storage system.
The size of the data block used during the data deduplication operation may impact the performance of a storage system. For example, more duplicate data blocks may be identified if the size of the data blocks used during the data deduplication operation is smaller than when the size of the data blocks is larger as a data object may be separated or divided into a larger number of data blocks that are each compared with data blocks currently stored at the storage system. As an example, if a data block corresponded to a sentence and if a difference between two sentences is a single character (e.g., an added punctuation character), then a comparison of two such data blocks may not identify a duplicate data block (e.g., a duplicate sentence). However, if the size of a data block is a portion of a sentence, then different portions of the two sentences may be identified as being a duplicate and only the portion of one of the sentences that includes the additional character may not be identified as a duplicate. As a result, more portions of the data object may be replaced with references to duplicate data blocks that are currently stored and fewer portions of the data object may need to be stored at the storage system. However, a smaller size of data blocks used during the data deduplication operation may result in an increase in the amount of time to perform the data deduplication operation as more hash values may need to be generated and more comparisons between generated hash values and hash values of previously stored data blocks may be performed and the data object reference will need to include more pointers, resulting in increased representational size of the data object.
Aspects of the present disclosure may determine a size of a data block used in the data deduplication operation based on a workload of a received data object. The workload may correspond to a type of application that has generated or is used with the data object. Examples of a workload include, but are not limited to, types of word documents (e.g., text documents from particular word processing applications), types of databases (e.g., database files or snapshots), etc. The workload of the data object may be identified by any combination of inspecting the data object (e.g., parsing a portion of the data object contents), receiving an indication of an application that is associated with the data object (e.g., an application hint), a name of a file corresponding to the data object (e.g., the file extension), a size of the data object or a pattern of usage of the data object, etc. After the workload of the data object is identified, a size of a data block that is to be used during a data deduplication operation with the data object may be determined based on the identified workload. For example, a first data object of a first type of workload may be assigned a data block size of 4 kilobytes (KB) and the data deduplication operation for the first data object may be based on 4 KB data blocks of the data object. A second data object of a second type of workload may be assigned a different data block size of 8 kilobytes (KB) and the data deduplication operation for the second data object may thus be based on 8 KB data blocks of the second data object.
As such, the determining of the size of a data block to be used in a data deduplication operation based on the workload associated with a data object may improve the efficiency of a storage system. For example, the optimal or preferred data block size for different workloads may be different since the data objects are of different types and formats for different applications. If the size of a data block used in a data deduplication operation is too small, then multiple calls to the hash function and multiple comparisons may be performed between the smaller data blocks and currently stored data blocks to identify duplicates. For example, if duplicate data blocks of a particular type of workload may be identified by 8 KB data blocks, then one hashing function and one comparison may be performed as opposed to two hashing functions and at least two comparisons being performed if the data deduplication operation were to be performed with 4 KB data blocks. Furthermore, additional storage resources may be used to store more hash values from the hash function when the size of the data block is decreased.
Thus, the determining of the optimal size of the data block as described herein may reduce the storage capacity needed to store data objects as data deduplication is still performed on the data blocks of the data objects while improving storage system efficiency by reducing the number of retrievals of information (e.g., hash values) and comparison operations used during the data deduplication operation. No duplicate data may thus be stored at the storage system, resulting in less write transactions being performed at the storage devices of the storage system, which effectively increases the storage capacity of the storage devices. The fewer number of write transactions may increase the lifespan or viability of a storage device (e.g., a solid-state drive) used in the storage system. Furthermore, the fewer hashing functions and comparison operations being performed may result in an increase in performance of the writing of a data object to the storage device used in the storage system by decreasing the amount of time to store the data object as fewer data deduplication operations are being performed.
As shown in
In some embodiments, the solid-state drive 120 may be a solid-state drive (SSD) or any other such storage device. The non-volatile memory 122.1 to 122.n may include one or more chips or dies that may individually include one or more types of non-volatile memory devices. In some embodiments, the non-volatile memory devices of the non-volatile memory may be embodied as planar or three-dimensional NAND (“3D NAND”) non-volatile memory devices or NOR. However, in other embodiments, the non-volatile memory may be embodied as any combination of memory devices that use chalcogenide phase change material (e.g., chalcogenide glass), three-dimensional (3D) crosspoint memory, or other types of byte-addressable, write-in-place non-volatile memory, ferroelectric transistor random-access memory (FeTRAM), nanowire-based non-volatile memory, phase change memory (PCM), memory that incorporates memristor technology, Magnetoresistive random-access memory (MRAM), Spin Transfer Torque (STT)-MRAM, silicon-oxide-nitride-oxide-silicon (SONOS) memory, polymer memory such as ferroelectric polymer memory, ovonic memory, nanowire or electrically erasable programmable read-only memory (EEPROM), etc. In the same or alternative embodiments, a memory device may be a block addressable memory device, such as those based on NAND or NOR technologies. A memory device may also include future generation nonvolatile devices, such as a three dimensional crosspoint memory device, or other byte addressable write-in-place nonvolatile memory devices. In one embodiment, the memory device may be or may include memory devices that use chalcogenide glass, multi-threshold level NAND flash memory, NOR flash memory, single or multi-level Phase Change Memory (PCM), a resistive memory, nanowire memory, ferroelectric transistor random access memory (FeTRAM), anti-ferroelectric memory, magnetoresistive random access memory (MRAM) memory that incorporates memristor technology, resistive memory including the metal oxide base, the oxygen vacancy base and the conductive bridge Random Access Memory (CB-RAM), or spin transfer torque (STT)-MRAM, a spintronic magnetic junction memory based device, a magnetic tunneling junction (MTJ) based device, a DW (Domain Wall) and SOT (Spin Orbit Transfer) based device, a thiristor based memory device, or a combination of any of the above, or other memory. The memory device may refer to the die itself and/or to a packaged memory product. As previously described, the solid-state drive 120 may be arranged or configured as a solid-state drive. The disclosure also applies to Persistent Memory” and “Battery Backed DRAM, and hard disk drives (HDDs). However, examples described in the present disclosure are not limited to storage devices arranged or configured as SSDs. Thus, the present disclosure is not limited to an SSD-based storage system.
Furthermore, the host computer 110 may include a data deduplication component 124 that determines a size of a data block to be used in a data deduplication operation for data objects to be stored at the solid-state drive 120. The data deduplication component 124 may be software, hardware (e.g., a separate integrated circuit), or a combination of software and hardware that is located externally to the solid-state drive 120. Further details with regard to the data deduplication component 124 are described in conjunction with
Although
As shown in
Referring to
As shown in
As shown in
As such, the size of a data block used in a data deduplication operation (e.g., a unit of deduplication) may be based on the type of workload of a data object. Furthermore, the data deduplication operation used by the storage system may change as different data objects are received. For example, the unit of deduplication used in the data deduplication operation may vary over time as different types of data objects of different types of workloads are received to be stored at the storage system.
As shown in
Referring to
As shown in
The method 500 may be performed for each data block of a data object. For example, the method 500 may be performed for each data block of the data object where each data block is of the size that corresponds to the workload of the data object. As a result, the method 500 may be repeatedly performed for each data block of a data object until a final data block has been subjected to the data deduplication operation.
As shown in
The memory buffer 612 may be implemented using a volatile static random access memory (SRAM), or any other volatile memory, for at least temporarily storing digital information (e.g., the data, computer-executable instructions, applications, etc.) as well as context information for the solid-state drive 602. Further, the processing device 614 may be configured to execute at least one program out of at least one memory to allow the memory arbiter 620 to direct the information from the memory buffer 612 to the solid-state memory within the non-volatile memory packages 608.1-608.n via the channels 622.1-622.n. Furthermore, via the I/O interface 605, the controller 610 may receive commands issued by the host computer 604 for writing or reading the data to and from the solid-state memory within the non-volatile memory packages 608.1-608.n.
The non-volatile memory packages 608.1-608.n may each include one or more non-volatile memory dies, in which each non-volatile memory die may include non-volatile memory (e.g., NAND flash memory) configured to store digital information or data in one or more arrays of memory cells organized into one or more pages. For example, the non-volatile memory package 608.1 may include one or more non-volatile memory dies. Each of the one or more non-volatile memory dies may be used or assigned to one logical unit so that block addresses of one logical unit are not distributed between two or more logical units. Although not illustrated, the solid-state drive 602 may further include a persistent memory and a battery backed dynamic random access memory (DRAM) that may provide memory semantics and persistence beyond server power cycle operations. Examples of persistent memory include, but are not limited to Non-Volatile Dual In-line Memory Module (NVDIMM), 3D crosspoint memory, etc.
The machine may be a personal computer (PC), a tablet PC, a set-top box (STB), a Personal Digital Assistant (PDA), a cellular telephone, a web appliance, a server, a network router, a switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. Further, while a single machine is illustrated, the term “machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
The example computer system 700 includes a processing device 702, a main memory 704 (e.g., read-only memory (ROM), flash memory, dynamic random access memory (DRAM) such as synchronous DRAM (SDRAM), etc.) a static memory 706 (e.g., flash memory, static random access memory (SRAM), etc.), and a data storage device 718, which communicate with each other via a bus 730. The data storage device 718 may correspond to the solid-state drive 120 of
Processing device 702 represents one or more general-purpose processing devices such as a microprocessor, a central processing unit, or the like. More particularly, the processing device may be a complex instruction set computing (CISC) microprocessor, reduced instruction set computing (RISC) microprocessor, very long instruction word (VLIW) microprocessor, or processor implementing other instruction sets, or processors implementing a combination of instruction sets. Processing device 702 may also be one or more special-purpose processing devices such as an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a digital signal processor (DSP), network processor, or the like. The processing device 702 may be configured to execute instructions 726 for performing operations and steps discussed herein.
The computer system 700 may further include a network interface device 708 to communicate over the network 720. The computer system 700 also may include a video display unit 710 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)), an alphanumeric input device 712 (e.g., a keyboard), a cursor control device 714 (e.g., a mouse), a graphics processing unit 722, a signal generation device 716 (e.g., a speaker), graphics processing unit 722, video processing unit 728, and audio processing unit 732.
The data storage device 718 may include a machine-readable storage medium 724 (also known as a computer-readable medium) on which is stored one or more sets of instructions 726 or software embodying any one or more of the methodologies or functions described herein. The instructions 726 may also reside, completely or at least partially, within the main memory 704 and/or within the processing device 702 during execution thereof by the computer system 700, the main memory 704 and the processing device 702 also constituting machine-readable storage media.
In one implementation, the instructions 726 include instructions to implement functionality corresponding to data deduplication component (e.g., data deduplication component 124 of
Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the ways used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “identifying” or “determining” or “executing” or “performing” or “collecting” or “creating” or “sending” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage devices.
The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the intended purposes, or it may comprise a general purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct a more specialized apparatus to perform the method. The structure for a variety of these systems will appear as set forth in the description below. In addition, the present disclosure is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the disclosure as described herein.
The present disclosure may be provided as a computer program product, or software, that may include a machine-readable medium having stored thereon instructions, which may be used to program a computer system (or other electronic devices) to perform a process according to the present disclosure. A machine-readable medium includes any mechanism for storing information in a form readable by a machine (e.g., a computer). For example, a machine-readable (e.g., computer-readable) medium includes a machine (e.g., a computer) readable storage medium such as a read only memory (“ROM”), random access memory (“RAM”), magnetic disk storage media, optical storage media, flash memory devices, etc.
In the foregoing disclosure, implementations of the disclosure have been described with reference to specific example implementations thereof. It will be evident that various modifications may be made thereto without departing from the broader spirit and scope of implementations of the disclosure as set forth in the following claims. The disclosure and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.
The following examples pertain to further embodiments.
Example 1 is a system comprising an interface operatively coupled to a storage resource and a processing device that is coupled with the storage resource via the interface to receive a data object associated with a request to store the data object at the storage resource, identify a type of workload associated with the data object that is associated with the request to store the data object at the storage resource, determine a size of a data block of the data object based on the identified type of workload, and perform a deduplication operation for the data object based on the determined size of the data block.
In Example 2, in the system of Example 1, to perform the deduplication operation for the data object based on the determined size of the data block, the processing device is further to identify a plurality of data blocks from the data object, wherein each of the plurality of data blocks is of the determined size, and compare the data blocks of the determined size to previously stored data blocks stored at the storage resource.
In Example 3, in the system of any of Examples 1-2, to compare the data blocks of the determined size to the previously stored data blocks stored at the storage resource, the processing device is further to generate a hash value for a particular data block of the determined size, retrieve a plurality of previous hash values for the previously stored data blocks, and determine whether the generated hash value matches with any of the previous hash values for the previously stored data blocks.
In Example 4, in the system of any of Examples 1-3, the type of workload is associated with a particular application that has generated or used the data object.
In Example 5, in the system of any of Examples 1-4, to identify the type of workload associated with the data object, the processing device is further to identify content of the data object and identify a workload signature from the content of the data object, wherein the type of workload is identified based on the identified workload signature from the content of the data object.
In Example 6, in the system of any of Examples 1-5, to identify the type of workload associated with the data object, the processing device is further to identify a file name of the data object, wherein the identifying of the type of the workload is based on a file extension of the file name of the data object.
In Example 7, in the system of any of Examples 1-6, to identify the type of workload associated with the data object, the processing device is further to receive an application hint associated with the data object, wherein the application hint corresponds to information of an application that has provided the data object, and wherein the identifying of the type of the workload is based on the application hint.
In Example 8, in the system of any of Examples 1-7, to perform the deduplication operation for the data object based on the determined size of the data block, the processing device is further to perform the deduplication for each portion of the data object corresponding to a particular data block of the determined size.
In Example 9, in the system of any of Examples 1-8, to identify the type of the workload associated with the data object, the processing device is further to identify a pattern of usage or size of the data object, wherein the identifying of the type of the workload is based on the pattern of usage or the size of the data object.
In Example 10, in the system of any of Examples 1-9, the processing device is further to determine that the data object is not encrypted, wherein the performing of the deduplication operation is further based on the data object not being encrypted.
Example 11 is an apparatus comprising a processing device, operatively coupled with a storage device, to receive a request to store a file at the storage device, identify an application associated with the file that is from the request to store the file at the storage device, determine a size of a data block of the file based on the identified application; identify a plurality of data blocks from the file, wherein each of the plurality of data blocks from the file is of the determined size, perform a deduplication operation for each of the plurality of data blocks of the determined size from the file, and store at least a portion of the plurality of data blocks from the file at the storage device based on the deduplication operation.
In Example 12, in the apparatus of Example 11, to perform the deduplication operation, the processing device is further to generate a hash value for a particular data block of the plurality of data blocks of the determined size, retrieve a plurality of previous hash values for data blocks previously stored at the storage device, and determine whether the generated hash value matches with any of the previous hash values for the previously stored data blocks.
In Example 13, in the apparatus of any of Examples 11-12, the application corresponds to a particular application that has generated or used the file.
In Example 14, in the apparatus of any of Examples 11-13, to identify the application associated with the file, the processing device is further to identify content of the file, and identify a workload signature from the content of the file, wherein the application is identified based on the identified workload signature from the content of the file.
In Example 15, in the apparatus of any of Examples 11-14, to identify the application associated with the file, the processing device is further to identify a name of the file, wherein the identifying of the application is based on a file extension of the name of the file.
In Example 16, in the apparatus of any of Examples 11-15, to identify the application associated with the file, the processing device is further to receive an application hint associated with the file, wherein the application hint corresponds to information from the request, and wherein the identifying of the application is based on the application hint.
Example 17 is a method comprising receiving a data object associated with a request to store the data object at a storage resource, identifying a type of workload associated with the data object that is associated with the request to store the data object at the storage resource, determining, by a processing device, a size of a data block of the data object based on the identified type of workload, and performing a deduplication operation for the data object based on the determined size of the data block.
In Example 18, in the method of Example 17, performing the deduplication operation for the data object based on the determined size of the data block comprises identifying a plurality of data blocks from the data object, wherein each of the plurality of data blocks is of the determined size, and comparing the data blocks of the determined size to previously stored data blocks stored at the storage resource.
In Example 19, in the method of any of Examples 17-18, comparing the data blocks of the determined size to the previously stored data blocks stored at the storage resource comprises generating a hash value for a particular data block of the determined size, retrieving a plurality of previous hash values for the previously stored data blocks, and determining whether the generated hash value matches with any of the previous hash values for the previously stored data blocks.
In Example 20, in the method of any of Examples 17-19, the type of workload is associated with a particular application that has generated or used the data object.
In Example 21, in the method of any of Examples 17-20, identifying the type of workload associated with the data object comprises identifying content of the data object, and identifying a workload signature from the content of the data object, wherein the type of workload is identified based on the identified workload signature from the content of the data object.
In Example 22, in the method of any of Examples 17-21, identifying the type of workload associated with the data object comprises identifying a file name of the data object, wherein the identifying of the type of the workload is based on a file extension of the file name of the data object.
In Example 23, in the method of any of Examples 17-22, performing the deduplication operation for the data object based on the determined size of the data block further comprises performing the deduplication for each portion of the data object corresponding to a particular data block of the determined size.
In Example 24, in the method of any of Examples 17-23, identifying the type of workload associated with the data object comprises receiving an application hint associated with the data object, wherein the application hint corresponds to information of an application that has provided the data object, and wherein the identifying of the type of the workload is based on the application hint.
Example 25 is a system on a chip (SOC) comprising a plurality of functional units and a data deduplication component, coupled to the functional units, to receive a data object associated with a request to store the data object at the storage resource, identify a type of workload associated with the data object that is associated with the request to store the data object at the storage resource, determine a size of a data block of the data object based on the identified type of workload, and perform a deduplication operation for the data object based on the determined size of the data block.
In Example 26, the SOC of Example 25 further comprises the subject matter of Examples 2-10.
In Example 27, in the SOC of any of Examples 25-26, the data deduplication component is further operable to perform the subject matter of Examples 17-24.
In Example 28, in the SOC of any of Examples 25-7, the SOC further comprises the subject matter of Examples 11-16.
Example 29 is an apparatus comprising means for receiving a data object associated with a request to store the data object at the storage resource, means for identifying a type of workload associated with the data object that is associated with the request to store the data object at the storage resource, means for determining a size of a data block of the data object based on the identified type of workload, and means for performing a deduplication operation for the data object based on the determined size of the data block.
In Example 30, in the apparatus of Example 29, the apparatus further comprising the subject matter of any of Examples 1-10 and 11-16.
Example 31 is an apparatus comprising a memory and a processor coupled to the memory and comprising a data deduplication component, wherein the data deduplication component is configured to perform the method of any of Examples 17-24.
In Example 32, in the apparatus of Example 31, the apparatus further comprises the subject matter of any of Examples 1-10 and 11-16.
Example 33 is a non-transitory machine-readable storage medium including instructions that, when accessed by a processing device, cause the processing device to perform operations comprising receiving a data object associated with a request to store the data object at the storage resource, identifying a type of workload associated with the data object that is associated with the request to store the data object at the storage resource, determining a size of a data block of the data object based on the identified type of workload, and performing a deduplication operation for the data object based on the determined size of the data block.
In Example 34, in the non-transitory machine-readable storage medium of Example 33, the operations further comprise the subject matter of any of Examples 17-24.
While the present disclosure has been described with respect to a limited number of embodiments, those skilled in the art will appreciate numerous modifications and variations therefrom. It is intended that the appended claims cover all such modifications and variations as fall within the true spirit and scope of this present disclosure.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2017/091022 | 6/30/2017 | WO | 00 |