This technology generally relates to data storage management and, more particularly, methods for performing data deduplication on data blocks and devices thereof.
Storage drives or disks provide an easy, fast, and convenient way for backing up or storing data. As additional backups are made, additional disks and disk space are required. However, disks or storage drives add costs to any backup solution including the costs of the disks themselves, costs associated with powering and cooling the disks, and costs associated with physically storing the disks in the datacenter. Thus, it becomes desirable to maximize the usage of disk storage available on each disk.
One method of maximizing storage on a disk is to use some form of data deduplication techniques. Data deduplication is a data compression technique for eliminating redundant data. In an existing deduplication process, first data is compared to stored data to detect duplicates, that is, to identify or determine whether the first data is unique or not. Next, when the first data is identified as not being unique, the redundant first data is eliminated and replaced with a small reference that points to the stored data. However, prior existing technologies only perform data deduplication by comparing the data present in one data block with the data present in another data block. Unfortunately, prior existing technologies fail to perform data deduplication in a single data block.
An environment 10 with a plurality of client computing devices 12(1)-12(n), an exemplary storage management computing device 14, a plurality of storage drives 16(1)-16(n) is illustrated in
Referring to
The processor 18 of the storage management computing device 14 may execute one or more programmed instructions stored in the memory 20 for dynamic resource reservation based on classified input/output requests as illustrated and described in the examples herein, although other types and numbers of functions and/or other operation can be performed. The processor 18 of the storage management computing device 14 may include one or more central processing units (“CPUs”) or general purpose processors with one or more processing cores, such as AMD® processor(s), although other types of processor(s) could be used (e.g., Intel®).
The memory 20 of the storage management computing device 14 stores the programmed instructions and other data for one or more aspects of the present technology as described and illustrated herein, although some or all of the programmed instructions could be stored and executed elsewhere. A variety of different types of memory storage devices, such as a non-volatile memory, random access memory (RAM) or a read only memory (ROM) in the system or a floppy disk, hard disk, CD ROM, DVD ROM, or other computer readable medium which is read from and written to by a magnetic, optical, or other reading and writing system that is coupled to the processor 18, can be used for the memory 20.
The communication interface 24 of the storage management computing device 14 operatively couples and communicates with the plurality of client computing devices 12(1)-12(n) and the plurality of storage drives 16(1)-16(n), which are all coupled together by the communication network 30, although other types and numbers of communication networks or systems with other types and numbers of connections and configurations to other devices and elements. By way of example only, the communication network 30 can use TCP/IP over Ethernet and industry-standard protocols, including NFS, CIFS, SOAP, XML, LDAP, and SNMP, although other types and numbers of communication networks, can be used. The communication networks 30 in this example may employ any suitable interface mechanisms and network communication technologies, including, for example, any local area network, any wide area network (e.g., Internet), teletraffic in any suitable form (e.g., voice, modem, and the like), Public Switched Telephone Network (PSTNs), Ethernet-based Packet Data Networks (PDNs), and any combinations thereof and the like. In this example, the bus 26 is a universal serial bus, although other bus types and links may be used, such as PCI-Express or hyper-transport bus.
Each of the plurality of client computing devices 12(1)-12(n) includes a central processing unit (CPU) or processor, a memory, and an I/O system, which are coupled together by a bus or other link, although other numbers and types of network devices could be used. The plurality of client computing devices 12(1)-12(n) communicates with the storage management computing device 14 for storage management, although the client computing devices 12(1)-12(n) can interact with the storage management computing device 14 for other purposes. By way of example, the plurality of client computing devices 12(1)-12(n) may run application(s) that may provide an interface to make requests to access, modify, delete, edit, read or write data within storage management computing device 14 or the plurality of storage drives 16(1)-16(n) via the communication network 30.
Each of the plurality of storage drives 16(1)-16(n) includes a central processing unit (CPU) or processor, and an I/O system, which are coupled together by a bus or other link, although other numbers and types of network devices could be used. Each plurality of storage drives 16(1)-16(n) assists with storing data, although the plurality of storage drives 16(1)-16(n) can assist with other types of operations such as storing of files or data. Various network processing applications, such as CIFS applications, NFS applications, HTTP Web Data storage device applications, and/or FTP applications, may be operating on the plurality of storage drives 16(1)-16(n) and transmitting data (e.g., files or web pages) in response to requests from the storage management computing device 14 and the plurality of client computing devices 12(1)-12(n). It is to be understood that the plurality of storage drives 16(1)-16(n) may be hardware or software or may represent a system with multiple external resource servers, which may include internal or external networks.
Although the exemplary network environment 10 includes the plurality of client computing devices 12(1)-12(n), the storage management computing device 14, and the plurality of storage drives 16(1)-16(n) described and illustrated herein, other types and numbers of systems, devices, components, and/or other elements in other topologies can be used. It is to be understood that the systems of the examples described herein are for exemplary purposes, as many variations of the specific hardware and software used to implement the examples are possible, as will be appreciated by those of ordinary skill in the art.
In addition, two or more computing systems or devices can be substituted for any one of the systems or devices in any example. Accordingly, principles and advantages of distributed processing, such as redundancy and replication also can be implemented, as desired, to increase the robustness and performance of the devices and systems of the examples. The examples may also be implemented on computer system(s) that extend across any suitable network using any suitable interface mechanisms and traffic technologies, including by way of example only teletraffic in any suitable form (e.g., voice and modem), wireless traffic media, wireless traffic networks, cellular traffic networks, G3 traffic networks, Public Switched Telephone Network (PSTNs), Packet Data Networks (PDNs), the Internet, intranets, and combinations thereof.
The examples also may be embodied as a non-transitory computer readable medium having instructions stored thereon for one or more aspects of the present technology as described and illustrated by way of the examples herein, as described herein, which when executed by the processor, cause the processor to carry out the steps necessary to implement the methods of this technology as described and illustrated with the examples herein.
An example of a method for performing data deduplication on data blocks will now be described herein with reference to
Next in step 310, the storage management computing device 14 splits each data block to a granular size (segment size). By way of example, the granular size can be 512 bytes, 256 bytes, or 128 bytes, although the data block can be split into other different sizes. In this example,
Next in step 315, the storage management computing device 14 determines a checksum for data in each segment within each of the data block. In this example with reference to
In step 320, the storage management computing device 14 compares the determined checksum of data in each segment of the data block to identify duplicate blocks of data within each of the data block. In this example, two segments having the same checksum value is determined to be duplicate blocks of data within the same data block. By way of example with reference to
Next in step 325, the storage management computing device 14 creates a unique signature for each of the segment that is determined to have equal checksum for each of the data block that was received. By way of example with reference to
Next in step 330, the storage management computing device 14 stores the created signature in the header field of the data, although the storage management computing device 14 can store the created signature at other locations. When there is a request to either read or write the data block that was sent from one of the plurality of client computing devices 12(1)-12(n), the storage management computing device 14 can extract the signature that is stored in the header to reconstruct the full data block.
In step 335, the storage management computing device 14 performs data compaction on all four data blocks for which the signature was created. The technique of data compaction has been illustrated in the U.S. Publication No. 2017/0031614A1, which is hereby incorporated by reference in its entirety. By way of example, the result of data compaction of the four data blocks with signature is illustrated in
Next in step 340, the storage management computing device 14 stores the data blocks in the data compacted form in the plurality of storage drives 16(1)-16(n) as illustrated in
However, if back in step 320, when the storage management computing device 14 determines that segments of the data blocks does not have the same checksum, then the No branch is taken to step 345. In this example, when the checksum of the two segments within the same data block does not match, it indicates that the data in the segments of the same block are not duplicate or repetitive data.
In step 345, the storage management computing device 14 stores data blocks in the format that was received in the plurality of storage drives 16(1)-16(n), although the storage management computing device 14 can store the received blocks of data in other formats and other memory locations. The exemplary flow of the method then proceeds back to step 305 where the storage management computing device 14 receives the next data blocks from the plurality of client computing devices 12(1)-12(n).
Accordingly, as illustrated and described by way of the examples herein, this technology provides a number of advantages including providing methods, non-transitory computer readable media and devices for performing deduplication on data blocks. Using the above illustrated examples, the disclosed technology is able to significantly reduce the storage space of the data blocks in the storage drives thereby managing the memory space in a more efficient manner. Alternatively, the disclosed technology can also be used to perform deduplication at granularity level even in cases where the full filesystem block is not full of same pattern.
Having thus described the basic concept of the technology, it will be rather apparent to those skilled in the art that the foregoing detailed disclosure is intended to be presented by way of example only, and is not limiting. Various alterations, improvements, and modifications will occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested hereby, and are within the spirit and scope of the technology. Additionally, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations therefore, is not intended to limit the claimed processes to any order except as may be specified in the claims. Accordingly, the invention is limited only by the following claims and equivalents thereto.