The amount and size of electronic data consumers and companies generate and use continues to grow in size and complexity, as does the size and complexity of related applications. In response, data centers housing the growing and complex data and related applications have begun to implement a variety of networking and server configurations to provide storage of and access to the data.
The following detailed description references the drawings, in which:
As users generate and consume greater amounts of data, the storage demands for these data also increase. Larger volumes of data become increasingly expensive, time consuming, and space consuming to store and access. Moreover, the amount of duplicate data, that is, data that is the same as previously existing data, is common. Such duplicate data further taxes storage resources.
Data deduplication (i.e., detecting duplicate data) in primary storage arrays is increasingly useful with the addition of solid state disks (SSDs) to the supported media in these arrays. The cost differential between SSDs and traditional hard disk drives utilizes solutions like deduplication and compression to reduce the cost per byte of these storage arrays. Primary storage arrays demand the high performance placed on them by host operating systems in terms of low latency and high throughput.
With storage capacities growing increasingly larger, finding duplicate data is a scaling problem that places demands on the central processing unit (CPU) and memory of the storage controllers. The impact of deduplication on input/output performance is determined by various parameters, such as whether data is deduplicated inline or in the background as well as the granularity of deduplication. Deduplicating data at a smaller granularity (such as 16 KB pages), while providing better space savings, requires an increase in CPU processing and memory. Some primary storage arrays are not able to deal with the conflicting demands of input/output performance with inline data deduplication, and consequently resort to background deduplication. Some arrays also address deduplication by deduplicating data in larger chunks (multiple gigabytes). In other examples, data duplication was detected, for example, using cryptographic hashes to determine duplicate data. These cryptographic hashes utilize more space to store and more processing resources to compare.
Deduplication in the computing environment can be performed at many layers, including the server, storage and backup solutions. However, many of the existing solutions are CPU and memory intensive, and do not employ hardware offload engines.
Various implementations are described below by referring to several examples of detecting duplicate data blocks using cyclic redundancy check and three-level table. In one example implementation according to aspects of the present disclosure, a method may include calculating, by a computing system, a cyclic redundancy check (CRC) value for a received data request. The method may further include translating, by the computing system, the CRC value into a physical page location using a three-level table. The method may also include detecting, by the computing system, whether the received data request represents duplicate data by comparing the received data request with a data stored at the physical page location.
In another example implementation according to aspects of the present disclosure, a system may include a processing resource. The system may also include a cyclic redundancy check module to calculate a cyclic redundancy check value of a received data page. Further, the system may include a three-level table module to translate the cyclic redundancy check value into a physical page location of a storage volume. The system may also include a deduplication detection module to determine whether the received data page matches an existing data page in the storage volume by performing an XOR operation and a zero detection operation.
In yet another example, a non-transitory computer-readable storage medium stores instructions that, when executed by a processor, cause the processor to perform the following functions; calculate a cyclic redundancy check (CRC) value for a received data page for a data store; apply the computed CRC value as a page offset into a deduplicate data store; translate the CRC value into a physical page location of the deduplicate data store; and detect duplicate data by determining whether an existing data page at the physical page location matches the received data page.
In some implementations, the described data duplication detection uses less storage space for detecting the duplicate data blocks than conventional cryptographic hashes. For example, by using cyclic redundancy check (CRC) as a first pass for determining duplicate data, the low incidence of CRC collisions (i.e., differing data with the same CRC value), the space utilized in storing hashes is greatly reduced. Conventional cryptographic hashes may use, for example, four to five times as much space for storing the hashes as compared to the CRC values. Additionally, the time needed to make the CRC value comparisons is reduced. These and other advantages will be apparent from the description that follows.
It should be understood that the computing device 100 may include any appropriate type of computing device, including for example smartphones, tablets, desktops, laptops, workstations, servers, smart monitors, smart televisions, digital signage, scientific instruments, retail point of sale devices, video walls, imaging devices, peripherals, or the like.
The computing system 100 may include a processing resource 102 that represents generally any suitable type or form of processing unit or units capable of processing data or interpreting and executing instructions. The instructions may be stored on a non-transitory tangible computer-readable storage medium, such as memory resource 104 of
In examples, as illustrated in
Returning to
The CRC module 110 calculates a cyclic redundancy check value or signature of a received data request in order to aid in locating the data on the physical volume (e.g., the data store 106). For example, when an input/output (I/O) request is received, such as data or a data page, the CRC module 110 calculates a CRC value (or signature) of the incoming data. Once the CRC value (or signature) of the incoming data request is calculated by the CRC module 110, the CRC value is compared to the CRC value of existing data already stored in a storage array (such as data store 106 of
In examples, the CRC module 110 may be a dedicated hardware module or offload engine that can compute the CRC of the received data request using, for example, the CRC32 algorithm. In other examples, the dedicated hardware module implementation of the CRC module 110 may compute the CRC value using higher precision hashes of data, such as the SHA-2 algorithm. Consequently, by offloading the traditionally processing resource intensive CRC value calculations to a dedicated hardware module, the processing resource (such as processing resource 102) is relived of performing the processing intensive calculations.
Once the value or signature of the incoming data is computed by the CRC module 110, the data is checked to see whether the same signature already exist in the volume receiving the data. In examples, this may also be offloaded to a dedicated hardware module or offload engine. At this point, the three-level table module 112 translates the CRC value into a physical page location or logical block address by performing a three-level table walk. In an example, a hidden thin provisioned volume referred to as a deduplicate data store that is not visible to users may be created.
When a page of data is received and the CRC value is computed for that page, the computed CRC is used as the page offset into the deduplicate data store thin provision volume. Since the deduplicate data store is a thin provision volume, a three-level translation, known as a three-level table walk, may be performed to translate the CRC value into a physical page location.
For the thin provisioned volumes, the data being accessed by the host is located using the three-level table module 112. This translation process is analogous to the way processors translate virtual addresses to physical addresses. The result of translating the host supplied logical block address (LBA) using the three-level page tables is a pointer to a 16KB page, for example, which contains the requested data. Thus, performing the three-level page table walk to translate a CRC value into a physical location pointer is a part of the I/O path in the operating system.
The three-level table walk results in either a physical page location or a null address, which implies that the offset has not been written. Thus, when the CRC value is used to walk the deduplicate data store, it can be determined by the deduplication detection module 114 whether another page within the deduplicate data store exists with the same CRC value.
If another page within the deduplicate data store does not exist, the incoming data request is written to that offset However, if a page does exist, an “exclusive or” (XOR) operation is performed between the new data page and the existing data page. Then, the three-level table module 112 performs a zero detection on the result of the XOR to determine whether the two data pages with the same signature are identical or different. If they are identical, the reference count on the page of data in the deduplicate data store is incremented. However, if they are not identical, a CRC collision is said to occur, and the page is stored in the data store 106 to which the original input/ouput data request was directed. In this way, two pages with identical signatures can be determined to be identical. In example, the three-level table module 112 may utilize special hardware, such as an application specific integrated circuit (ASIC) or other appropriate discrete hardware component to perform the XOR operation and/or the zero detection.
At block 302, the method 300 includes calculating a cyclic redundancy check (CRC) value for received data. For example, the method 300 may include calculating, by a computing system such as computing system 100 of
At block 304, the method 300 includes translating the CRC value into a physical page location using a three-level table. For example, the method 300 may include translating, by the computing system such as computing system 100 of FIG, 1, the CRC value into a physical page location using a three-level table as in
At block 306, the method 300 includes detecting whether the received data represents duplicate data by comparing the received data with data stored at the physical page location. For example, the method 300 may include detecting, by the computing system such as the computing system 100 of
If another page (i.e., an existing data page) within the deduplicate data store does not exist, the incoming data request is written to that offset. However, if a page does exist, an “exclusive or” (XOR) operation is performed between the new data page and the existing data page. Then a zero detection is performed on the result of the XOR to determine whether the two data pages with the same signature are identical or different. If they are identical, the reference count on the page of data in the deduplicate data store is incremented. However, if they are not identical, a CRC collision is said to occur, and the page is stored in the data store to which the original input/ouput data request was directed. In this way, two pages with identical signatures can be determined to be identical. In example, special hardware, such as an application specific integrated circuit (ASIC) or other appropriate discrete hardware component may be implemented to perform the XOR operation and/or the zero detection.
Additional processes also may be included, and it should be understood that the processes depicted in
At block 402, the method 400 includes calculating a cyclic redundancy check (CRC) value for received data. For example, the method 400 may include calculating a cyclic redundancy check (CRC) value for a received data page for a data store. When an input/output (I/O) request is received, such as a data page, the CRC value is calculated, such as by the CRC module 110 of
At block 404, the method 400 includes applying the computed CRC value as a page offset. For example, the method 400 may include applying the computed CRC value as a page offset into a deduplicate data store. When a page of data is received and the CRC value is computed for that page, the computed CRC is used as the page offset into the deduplicate data store thin provision volume. Since the deduplicate data store is a thin provision volume, a three-level translation, known as a three-level table walk, may be performed to translate the CRC value into a physical page location. The method continues at block 406.
At block 406, the method 400 includes translating the CRC value into a physical page location. For example, the method 400 may include translating the CRC value into a physical page location of the deduplicate data store. The result of translating the host supplied logical block address (LBA) using the three-level page tables is a pointer to a 16KB page, for example, which contains the requested data. Thus, performing the three-level page table walk to translate a CRC value into a physical location pointer is a part of the I/O path in the operating system.
The three-level table walk results in either a physical page location or a null address, which implies that the offset has not been written. Thus, when the CRC value is used to walk the deduplicate data store, it can be determined whether another page within the deduplicate data store exists with the same CRC value. The three-level table walk may be performed, for example, by a discrete hardware component, such as an application-specific integrated circuit. The method continues at block 408.
At block 408, the method 400 includes detecting duplicate data. For example, the method 400 may include detect duplicate data by determining whether an existing data page at the physical page location matches the received data page. If another page (i.e., an existing data page) within the deduplicate data store does not exist, the incoming data request is written to that offset. However, if a page does exist, an “exclusive or” (XOR) operation is performed between the new data page and the existing data page. Then a zero detection is performed on the result of the XOR to determine whether the two data pages with the same signature are identical or different. If they are identical, the reference count on the page of data in the deduplicate data store is incremented. However, if they are not identical, a CRC collision is said to occur, and the page is stored in the data store to which the original input/ouput data request was directed. In this way, two pages with identical signatures can be determined to be identical. In example, special hardware, such as an application specific integrated circuit (ASIC) or other appropriate discrete hardware component may be implemented to perform the XOR operation and/or the zero detection.
Additional processes also may be included, and it should be understood that the processes depicted in
It should be emphasized that the above-described examples are merely possible examples of implementations and set forth for a clear understanding of the present disclosure. Many variations and modifications may be made to the above-described examples without departing substantially from the spirit and principles of the present disclosure. Further, the scope of the present disclosure is intended to cover any and all appropriate combinations and sub-combinations of all elements, features, and aspects discussed above. All such appropriate modifications and variations are intended to be included within the scope of the present disclosure, and all possible claims to individual aspects or combinations of elements or steps are intended to be supported by the present disclosure.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/036045 | 4/30/2014 | WO | 00 |