This application claims priority to foreign patent application no. GB 0912846.3, filed 24 Jul. 2009. This application is hereby incorporated by reference as though fully set forth herein.
In storage technology, deduplication is a process in which data is analysed to identify duplicate portions in the data. One of the identified portions can then be stored using a small footprint data identifier, such as a hash, with a locator for the stored duplicate data, instead of duplicating the identified portion in data storage. In this manner, with certain types of data, it is possible to increase the amount of data stored using a given storage capacity.
In order that the invention may be well understood, by way of example only, various embodiments thereof will now be described with reference to the accompanying drawings, in which:
a to 3c illustrate stages in the processing of portions of a data stream;
Referring to
The data deduplication apparatus 2013 also includes secondary storage 2040. The secondary storage 2040 may provide slower access speeds than the memory 2030, and conveniently comprises hard disk drives, or any other convenient form of mass storage. The hardware of the exemplary data deduplication apparatus 2013 can, for example, be based on an industry-standard server. The secondary storage 2040 can be located in an enclosure together with the data processing apparatus 2020, 2030, or separately.
A link can be formed between the communications interface 2050 and a host communications interface 2080 over the network 2015, for example comprising a Gigabit Ethernet LAN or any other suitable technology. The communications interface 2050 can comprise, for example, a host bus adapter (HBA) using iSCSI over Ethernet or Fibre Channel protocols for handling backup data in a tape data storage format, a NIC using NFS or CIFS network file system protocols for handling backup data in a NAS file system data storage format, or any other convenient type of interface.
The program instructions 2031 also include modules that, when executed by the processor 2020, respectively provide at least one storage collection interface, in the form, for example, of a virtual tape library (VTL) interface 2033 and/or NAS interface (not shown), and a data deduplication engine 2035, as described in further detail below.
The virtual tape library (VTL) interface 2033 in the example is to emulate at least one physical tape library, facilitating that existing storage applications, designed to interact with physical tape libraries, can communicate with the interface 2033 without significant adaptation, and that personnel managing host data backups can maintain current procedures after a physical tape library is changed for a VTL. A communications path can be established between a storage application and the VTL interface 2033 using the interfaces 2050, 2080 and the network 2015. A part 2090 of the communications path between the VTL interface 2033 and the network 2015 is illustrated in
The VTL interface 2033 can receive a stream of data 3100 as shown in
Referring to
The NAS interface, if provided, presents a file system to the host storage application. A NAS backup file can, for example, comprise a relatively large backup session file provided as a data stream by a backup application 2085. Meta data relating to a typical NAS backup session file may be integrated in the backup session file or provided in one or more separate files. In some embodiments, the command meta data is not stripped from the data stream.
The stripped data stream 3200 (
The storage collection interface also comprises an encoded entity handler 2061. The encoded entity handler 2061 is operable to examine the stripped data stream 3200 and identify in the data stream 3200 meta data associated with an encoded data entity, the meta data relating to an encoding process that has been used to encode the data entity. For example, the encoded entity handler 2061 is provided with compression scheme recognition data that is associated with predetermined data compression schemes, enabling the encoded entity handler 2061 to recognise from header meta data 3220, 3221, 3222 a data compression scheme that has been applied to a respective compressed data entity 3215, 3216, 3217 disposed immediately subsequent to the header meta data in the data stream 3200. The compression scheme recognition data can relate to any desired data compression scheme.
In one example, the encoded entity handler 2061 includes compression scheme recognition data to identify files that have been encoded using a ZIP file format, the format specification for which is readily available. An example, is the ZIP file format specification version 6.3.2 published by PKWARE Inc. The structure of such a ZIP file, containing multiple files, file 1 banana.txt and file 2 apple.txt, that have been compressed into the ZIP file, takes the form:
The [local file header 1] is structured as follows:
local file header signature 4 bytes (0x04034b50)
version needed to extract 2 bytes
general purpose bit flag 2 bytes
compression method 2 bytes
last mod file time 2 bytes
last mod file date 2 bytes
crc-32 4 bytes
compressed size 4 bytes
uncompressed size 4 bytes
file name length 2 bytes
extra field length 2 bytes
In this example, the compression scheme recognition data includes at least the four byte value 0x04034b50 representing a ZIP local file header signature. The encoded entity handler 2061 examines the sequence of bytes in the data stream 3200 and, if it encounters an apparent ZIP local file header signature, identifies the immediately following meta data as encoded data entity meta data. The encoded entity handler 2061 can also be operable to perform additional checks for expected value ranges in other expected fields in the identified ZIP local file header to prevent misdetection.
In response to confirmed identification of a ZIP encoded data entity, the identified ZIP file header meta data is used to decode the encoded data entity by decompressing the file data according to information contained in the respective ZIP file headers for each compressed file. For example, the [file header 1] in the [central directory] of the exemplary ZIP file can have the following structure:
The encoded entity handler 2061 is operable to use, for example, the data in at least the [file header 1] fields “compression method”, “version needed to extract”, and “version made by” to decompress the [file data 1] encoded data. Other files, such as [file data 2], in the compressed data entity are also decompressed accordingly. The resulting data stream 3300 is shown in
The decompressed file size can be compared to the expected uncompressed file size as specified in the headers as an additional check for correct ZIP file identification. Meta data contained in the [local file header], [file header] and [end of central directory record] files is stored as encoded entity meta data 2066 in the meta data store 2067. The data stream is processed in an in-line manner. The compressed and non-compressed data contained in the records is not stored to relatively slow secondary storage such as the storage 2040 prior to deduplication.
Although the command meta data 2065 and the encoded entity meta data 2066 are shown in one meta data store 2067, separate meta data stores could be provided. The meta data stores can be structured in any convenient manner, for example using a file system or database. Program instructions (not shown) for generating and operating the or each data store can conveniently be stored in the memory 2030.
As shown in
The hasher 4011 is operable to process a data chunk 4018 using a hash function that returns a number, or hash, that can be used as a chunk identifier 4019 to identify the chunk 4018. The chunk identifiers 4019 are stored in manifests 4022 in a manifest store 4020 in secondary storage 2040. Each manifest 4022 comprises a plurality of chunk identifiers 4019. The chunk identifiers 4019 are represented in
The matcher 4012 is operable to attempt to establish whether a data chunk 4018 in a newly arrived segment 4015 is identical to a previously processed and stored data chunk. This can be done in any convenient manner. If no match is found for a data chunk 4018 of a segment 4015, the storer 4013 will store the corresponding unmatched data chunk 4018 from the buffer 4030 to a deduplicated data store 4021 in secondary storage 2040, as shown by the unbroken arrows in
Because the compressed entities are presented to the deduplication engine 2035 in decoded form, there can be a significantly increased probability of obtaining a larger number of matching data chunks 4018 during the matching process in many data storage situations, for example multiple sequential data backup sessions. For example, as shown in
Data chunks 4018 are conveniently stored in the deduplicated data store in relatively large containers 4023, having a size, for example, of say between 2 and 4 Mbytes, or any other convenient size. Data chunks 4018 can be processed to compress the data if desired prior to saving to the deduplicated data store 4021, for example using LZO or any other convenient compression algorithm. It will be appreciated that the skilled person will be able to envisage many alternative ways in which to store and match the chunk identifiers and data chunks. If the cost of an increase in size of fast access memory is not a practical impediment, at least part of the manifest store and/or the deduplicated data store could be retained in fast access memory.
As shown in
In response to the command handler 2060 receiving a read request, the de-duplication engine 2035 is instructed by the storage collection interface 2033 to reassemble the requested data, which will reassemble a portion of the decompressed data stream 3300. The encoded entity handler 2061 accesses the relevant encoded entity meta data 2066 from the meta data store 2067, and where appropriate assembles the resulting data into compressed entities with associated compressed entity headers, resulting in a data stream structured similarly to the data stream 3200 of
At least some of the embodiments described above provide a greater opportunity for the data deduplication engine to match data entities, or portions of data entities, which in the unencoded condition thereof have many identical chunks, but which lose that identity when even slightly changed and encoded as part of a storage data stream, for example a backup data stream. This facilitates, at least when used with certain types of data, a decrease in the volume of data required to be stored and a consequential increase in the amount of data that can be stored using a defined storage capacity.
There may be some residual level of duplication of data chunks in the deduplicated data store 4021, and the terms deduplication and deduplicated should be understood in this context. In alternative embodiments, other techniques of deduplication can be employed than as described above.
While various embodiments have been described above with reference to data entities encoded using data compression schemes, the invention also has application to data entities encoded using other types of data encoding schemes, for example data encryption schemes. In the example of data encryption schemes, an appropriate key management arrangement is necessary, for example to securely provide appropriate encryption and/or decryption keys to the data deduplication apparatus.
Number | Date | Country | Kind |
---|---|---|---|
0912846.3 | Jul 2009 | GB | national |