The present disclosure relates generally to data deduplication and efficient data storage, and more particularly, to techniques for storing moving pictures.
Moving pictures can require substantial amounts of memory for storage in digital form. In the motion picture industry, for example, when a major motion picture is released, it is not uncommon for a movie studio to produce between 30-70 variations of the movie. Some variations can be minor in nature. These variations may include, for example, an original master and a number of modified versions of the original master. The different versions may be created for various reasons. Some versions may include adjusted introduction information or credits to account for different audiences or formats. Other versions may include additional or different scenes (such as for a director's cut and a theatre version, for instance). Still other versions may be the result of editing the movie to make it compatible with a specific rating (e.g., PG-13, R, GP, NC-17, and the like), or with the laws of a specific country in which the film is intended for distribution. The editing to produce such variations may involve editing, changing, and deleting scenes, and/or producing more than one version of a scene.
The end result is that the studio or production company typically stores a potentially large number of versions of a film, albeit, in many cases, each version including minor differences as determined by the overall percentage of similarities of the file(s) with the original. For full-length feature films, each such version can occupy 250-500 GB or more of storage. In addition, any given studio may produce or otherwise own hundreds or thousands of different movies altogether, each with their own set of variations. Current libraries used to store these types of files often include over 200,000 moving pictures. As a result, these production and studio companies often store hundreds of thousands of copies of the movies in multiple locations to operate. These requirements can translate to exabytes of data storage and exorbitant storage costs.
These and other limitations are addressed in the present disclosure.
In an aspect of the disclosure, a deduplication method, a processing system, and a computer readable medium for storing moving picture files are provided.
In one aspect of the disclosure, a deduplication method includes storing, in a memory, first data representing a moving picture, determining a data subset comprising differences between second data representing another version of the moving picture and the first data, and storing the data subset in the memory, wherein the data subset and the first data are sufficient to reproduce the second data.
In another aspect of the disclosure, a processing system includes a memory configured to store first data representing a moving picture, and one or more processors configured to identify second data representing another version of the moving picture, determine a data subset comprising differences between the first data and the second data such that the second data is reproducible using at least the data subset and the first data and without the second data, and store the data subset in the memory.
In another aspect of the disclosure, a computer readable medium includes code that, when executed by one or more processors, causes the one or more processors to store, in a memory, first data representing a moving picture, identify second data representing another version of the moving picture, determine a data subset comprising differences between the first data and the second data such that the second data is reproducible using at least the data subset and the first data and without the second data, and store the data subset in the memory.
Additional advantages and novel features will be set forth in part in the description that follows, and in part will become more apparent to those skilled in the art upon examination of the following or upon learning by practice thereof.
The detailed description set forth below in connection with the appended drawings is intended as a description of various configurations and is not intended to represent the configurations in which the concepts described herein may be practiced. The detailed description includes specific details for the purpose of providing a thorough understanding of various concepts. However, it will be apparent to those skilled in the art that these concepts may be practiced without these specific details. In some instances, well known structures and components are shown in block diagram form in order to avoid obscuring such concepts.
Several aspects of systems for data transfer will now be presented with reference to various apparatus and methods. These apparatus and methods will be described in the following detailed description and illustrated in the accompanying drawings by various blocks, modules, components, circuits, steps, processes, algorithms, etc. (collectively referred to as “elements”). These elements may be implemented using electronic hardware, computer software, or any combination thereof. Whether such elements are implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system.
By way of example, an element, or any portion of an element, or any combination of elements may be implemented with a “processing system” that includes one or more processors. Examples of processors include microprocessors, microcontrollers, digital signal processors (DSPs), field programmable gate arrays (FPGAs), programmable logic devices (PLDs), state machines, gated logic, discrete hardware circuits, and other suitable hardware configured to perform the various functionality described throughout this disclosure. One or more processors in the processing system may execute software. Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executables, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
Accordingly, in one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or encoded as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer storage media. Storage media may be any available media that can be accessed by a computer. By way of example, and not limitation, such computer-readable media can comprise a random-access memory (RAM), a read-only memory (ROM), an electrically erasable programmable ROM (EEPROM), compact disk ROM (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices, combinations of the aforementioned types of computer-readable media, or any other medium that can be used to store computer executable code in the form of instructions or data structures that can be accessed by a computer.
Block 102 represents a source node, such as the apparatus 800 for transferring files in
As illustrated in
In the above illustration, there is an original version of F1 (104) and a copy of F1 (114) where the copy was either transferred directly from source node 102, was reproduced from MAP1 and other data or was made available by some other means.
The technique of
Using an appropriate algorithm as illustrated below, large files may be represented as one or more maps of fingerprints where the map is comparatively smaller than the original file, where the fingerprint map may be used to identify similarities within files.
A file in general consists of any number of bytes that can be represented by values in the range {00h . . . FFh}. If a subset of byte patterns can be identified whose probability of random occurrence is very small and it is ensured that the patterns occur once, such patterns may be used as fingerprints of the file. A fingerprint is similar to a checksum in this regard, except that it is not computed but created from the file itself.
Four criteria for a statistically qualifying fingerprint include (i) minimal frequency of occurrence of the fingerprint within the data stream, (ii) minimal frequency of occurrence of the fingerprint with the fingerprint map, (iii) a high entropy (i.e., highly random bits), and (iv) low collision probability. In the context of the 128-bit fingerprinting algorithm discussed in greater detail below, the following general process may be used to gather the fingerprint map of a stream:
1. Divide the stream into fragments, which gives the sample window length.
2. Sample a 128 bit packet in window of data, which represents the 128 bit candidate fingerprint.
3. Check the 128 bit candidate fingerprint for minimum entropy.
4. Check the 128 bit candidate fingerprint for low frequency.
5. If the conditions are satisfied and the fingerprint meets a quality threshold, shift the window by 128 bits and repeat from step 2.
The bytes 906 and coordinates 908 are placed into a candidate 128-bit fingerprint 910. The candidate fingerprint 910 is then matched against any other corresponding fingerprints selected to date, as shown in pattern 912. In one configuration, a determination of whether a match is present is made at each byte boundary of the original pattern. If a match is found, the candidate fingerprint 910 is discarded and the next 256 byte section of the pattern is analyzed.
Generally, it can be shown that assuming a “normal” distribution of bits in a sample of any size, the probability of occurrence of any given fingerprint is the same, and that probability depends only on the size of the sample. For example, the probability of a 128 bit fingerprint occurring once in a one Terabyte random sample is about 1 in a five hundred million. While the presence of a “normal” distribution of bits is rare in practice with everyday files, matching fingerprints may be eliminated from the map by checking the file for fingerprint repetitions as described above. Once the distribution of a 256 byte sample is analyzed, the byte spectrum may be taken as the basis to select the fingerprint that would have the lowest probability of occurrence if the file had the same distribution as the block sample.
Thus, in the embodiment of
In sum, the 128-bit (16-byte) fingerprints are constructed by sampling 256-byte continuous blocks from the file. The coordinates 908 in
The offset at which the fingerprint found marks the beginning of the link, and the next fingerprint will mark the links end (1016). In this manner, each link is assigned a new fingerprint and each new candidate fingerprint is ensured to not match against any previous one, which in turn ensures that each fingerprint occurs only once in the file. In addition to the fingerprint, each link size and offset is also stored along with its sha2 message digest in the fingerprint map (1018). Continuing this technique, an array of fingerprints of the file is obtained (1020). The number of fingerprints may be determined by the number of divisions of the file and the byte entropy of the file. Generally, this number is no more than the granularity selected. The granularity, in turn, may be the predetermined average number of divisions, which in one embodiment is at a minimum ten times the number of expected file changes. The total fingerprint map FM in one embodiment is the total number of fingerprints, array hashes, message digests and link offsets of the original file. The map is sufficient to compute the Δ between the file and any other file, as described further below.
In an embodiment, the fingerprint map is computed using an entropy equation characterized by an alphabet of N letters, whereby for each data chunk a probability is determined that a K length word has less than L number of repeating letters. For a fingerprint of 16 bytes, this problem is solved for L=1 to 16, with K=16 and N=256. The data chunk length may vary depending on the boundary of each identified fingerprint.
In one embodiment, the minimum entropy is defined as the frequency of occurrence of any octet at an 8 bit boundary in the 128 bit fingerprint (128 bit sequence). This frequency of occurrence may be designated as a number range between 1-16 such that the higher the number the lower the entropy and the lower the worth of a fingerprint, and vice versa. If the entropy of a fingerprint is greater than a minimum threshold, it is not a good candidate and another candidate is selected. Based on real world data in the experience of the inventors, the best entropy minimum appears to be around 4 in the 1-16 number range. The given criteria can be mathematically verified based on the minimum entropy and assuming a normal random distribution of bits in a stream. The probabilities of occurrence for the different minimum entropies are set forth in the following tables:
In some situations, acceptable approximations may be made. Finding quality fingerprints cannot always be guaranteed as it is highly dependent upon the byte entropy of the file. For example, there may be large sections of a file consisting of only a few different bytes, and long sequences of repeating bytes. In many embodiments, the 128 bit fingerprint is strong down to a minimum of 4 different bytes in the maximum separation range of 256 bytes. At this or greater separation range, however, the data may become extremely compressible. Thus, while fingerprinting becomes progressively more difficult to find in the file, the file's compressibility may increase by magnitudes. In the event the fingerprinting algorithm identifies this situation, in one configuration, the algorithm may simply skip analysis of the entire link at issue because the link does not meet the minimum entropy requirement. Thereupon, the link may be marked as “compressible” and its byte statistics may be stored in the FM (MAP1 in
It is assumed for the purposes of this illustration that a number of changes or updates are made to F1, whether at source node 102 or otherwise, to produce file F2 as reflected at block 108. As is often the case, updated file F2 may contain large amount of redundancies. The redundancies may also consume a comparatively high amount of the overall file space. Examples of redundancies may include images, color graphics, and blocks of identical text. The greater the file size and the greater the number of redundancies, the greater the savings of bandwidth that can be achieved. Referring back to
In one embodiment, Δ1 is obtained by first checking F2 at each byte offset of the file for matches with the fingerprints of MAP1.
While the technique described in this embodiment does not guarantee this conclusion to a certainty (such a guarantee would generally require a byte to byte comparison), the conclusion is likely in view of the very small collision chance of the sha2 digest. Furthermore the chance that an sha2 collision occurs while the fingerprints themselves match is even smaller.
In the event that the sha2 digests do not match, the search for matching fingerprints continues, and all bytes that are processed are marked as new data (1110). Also, if an adjacent fingerprint does not match, all bytes processed continue to be marked as new data. After processing the remainder of F2 in this manner, the changes between F1 and F2 are identified and the unmodified links are discovered. The sum of the new data and the unmodified links is Δ, which can be applied to MAP1 to obtain F2.
In an embodiment, during the delta processing where certain portions of F2 are found to be compressible, they may be encoded with a run-length-encoded (RLE) compression formula, which can achieve an extremely high compression ratio. The degree of compressibility may be high in view of the fact that the byte entropy may be required to get very low before the fingerprinting fails for that portion. For example, a repeating pattern of 1 Kbyte may be compressed to 100:1 by RLE, and the compressed data is than added to Δ1. The described algorithms are easily tunable and scale well for multiple CPU cores. The fingerprint mapping may be computed simultaneously with the delta or from the backup copy of the file. How and when the delta is applied depends on the situation and may be implemented separately depending on the applications needs. The minimum link size and the fingerprint range are tunable to accommodate smaller files, but the algorithm is well suited for files greater than 1 Mb. In one configuration, the default and maximum fingerprint range is 256 bytes, which is the maximum distance between two fingerprint bytes. Compression is optional but for sparse files it may add significant additional benefits for accelerating file transfer operations.
Referring again back to
At this point in time, destination node 112 has the Δ1 and F1 (or MAP1). Thereupon, destination node 112 is able to rapidly compute F2 based on a standard application of Δ1 to F1 (block 118).
A number of alternative embodiments may be contemplated in view of
In
Whether any given file is a candidate for delta compression as described herein may be determined using any of a variety of methods at the source node. For example, certain fields or metadata corresponding to a document may be retrieved at the source node to determine whether the file has been identified as an updated or modified version of an existing file. In another configuration, the document title may be provided in an identifiable format. Alternatively, a basic comparison of the content of a candidate file may be made with one or more existing files to determine its suitability for delta compression.
In another embodiment where a file is subsequently updated a number of times, the source node may only maintain and save a FM of the original file, and may reuse the FM and recomputed respective deltas for each subsequent modified version of the file. Alternatively, the source node may create a new FM of each iteration of the file.
By using the techniques disclosed herein, substantial bandwidth savings can be achieved since only the changes to documents need to be transmitted over the network to a destination. The FM may be maintained in memory at the source node for use in computing deltas corresponding to future modifications.
Further, unlike the embodiment of
In alternative embodiments, MAP1 may be generated on the fly at the destination node and transferred to the source node, such as, for example, in response to determining that F2 at the source node and F2 at the destination node contain differences. Thereupon, Δ is calculated (block 310) using MAP1 and F2 (block 310). The Δ is then transmitted via network 320 to destination node 312, where it is used along with F1 to reproduce F2. Subsequently, where a third file is received that is a modified version of the second, the source node 302 or destination node 312 may create MAP2 on the fly corresponding to that file for subsequent delta operations (block 311).
In addition to the advantages associated with the embodiment of
In
Referring to
In alternative embodiments to those shown in
In
A2 represents one or more separate applications from which the data constituting the files are obtained. A2 may reside on the same machine or a different machine to that of source host 702. The “P/S” indication shows that the data may be sent via a pipe or socket connection to application A1, or another connection type depending on the physical configuration employed.
In another aspect of the disclosure, a technique for raw data differencing is disclosed, which can accelerate the transfer of raw data across a network. One use of the technique described in this implementation is an exemplary file transfer engine such as that described in connection with
In
As new files (e.g. D6, D7, D8) from stream D are generated and stored in the history cache, host 702 compares, via pipe 1208 or a similar interface, the new files to the fingerprint maps (e.g., FM1-FM5) stored in the history cache. If a match is detected, such as if packets in the file are detected to be identical to those represented in one of the stored fingerprint maps, the host 702 transmits an indication to the destination that the file (or particular sections thereof) is already present at the destination and also may send a pointer identifying the file or sections.
As an illustration involving a transfer of a data and a roughly 1 TB history cache 1204, as a first 100 MB of data arrives at the host 702, the history cache 1204 may be empty. After 1 TB of data is received, the host may create a 1 TB file constituting the data and a fingerprint map of the file. The file and fingerprint map may be stored in the history cache at the host. As additional data is received beyond the 1 TB, the data may be identified to correspond with one of the stored maps, whereupon the source 702 sends an appropriate indication over the network. It is understood that the specific order or hierarchy of blocks in the processes/flow charts disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of blocks in the processes/flow charts may be rearranged. Further, some blocks may be combined or omitted. The accompanying method claims present elements of the various blocks in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
In other aspects of the disclosure, deduplication techniques for storage of moving picture data are presented. For the purposes of this disclosure, a “moving picture” broadly includes all forms of video and media files, motion pictures, film, television, video clips, shorts, interactive graphics, moving images, animation, and the like. While embodiments in this disclosure are described for clarity in the illustrative case of multiple versions of a motion picture, the concepts disclosed and claimed herein broadly encompass applications relating to all forms of moving pictures as defined above. Data deduplication refers broadly to data compression techniques for eliminating duplicate copies of repeating data.
With reference to the problem of storage inefficiencies arising from the need to store multiple versions of the same motion picture, approaches are presented herein for substantially reducing storage requirements. As noted above, many such versions involve minor differences including, for example, deletion of a promiscuous scene to change a film's rating (e.g., for distribution to an airline), changing of title or credit information, and the like. In lieu of storing multiple versions, a file can be generated to include only a data subset representing a difference between an original moving picture and another version thereof. The other version is represented as a difference in a small file. Only the original moving picture and the difference file(s) are stored. To reproduce the other version, the file is “differenced” with the original, using any of a plurality of techniques including without limitation those described in this disclosure.
Using these techniques, a studio or distribution company need only store one master of the film and one or more difference files representing different versions of the file. Each version of the movie may be reproduced using the difference file(s) and the master. The inventors have estimated that content distribution companies like Netflix® can save on the order of $100 million a year in storage costs. Similar companies in the content distribution space and the satellite and cable business stand to achieve comparable savings.
Referring back to
Differencing files D1-D3 and DT include data that represents the differences in, additions to, and or deletions from, the original moving picture 1302. In most instances, such files represent a very small percentage of the respective versions MP1-MP3. Having computed and generated the differencing files, the processing system stores D1-D3 (or alternatively, DT) in the memory 1304 for “long term” storage. The original moving picture files MP1-MP3 need not be preserved. In some embodiments, only a fingerprint map(s) of the original moving picture 1302 is stored in memory 1304. In some embodiments, fingerprint maps of one or more of the original moving picture files 1302 and/or files MP1-MP3 (1312, 1314, 1316) may be created, stored in memory 1304, and/or used to facilitate generation of the differencing file(s), as contemplated previously in this disclosure.
In other embodiments, the processing system may physically constitute two or more separate systems implementing different functions relevant to the above-described steps. For example, one part of the processing system may handle the generation of the differencing data, while another part of the processing system may handle the receiving of files and data and the storage of the files and data in memory, as well as the retrieval therefrom.
Subsequently, to reproduce the versions MP1-MP31306, 1308, 1310 based on the content of memory 1304, the processing system uses the original file FO and the differencing file D1 (1320) (using the techniques described herein, for example) to reproduce version MP1 (1326). The processing system uses the original file FO and the differencing file D2 (1322) to reproduce version MP2 (1328). The processing system uses the original file FO and the differencing file D3 (1322) to reproduce version MP3 (1330).
The described techniques may be implemented one or more standalone application suites in some embodiments that have user interfaces for manipulation by a user, or they may be implemented behind the file system relative to the perspective of a user. In one implementation of the interface for end users, the engine for differencing and storing different versions of an original moving picture can be hidden inside a user space file system. The process in this embodiment is transparent to the end user and integrates seamlessly with other capabilities of the processing system. In this embodiment, the user will not need to do anything new in order use the benefits of this technique other than mount to and read from another device (file system). In another embodiment, the differencing method is non-block aligned.
Other embodiments also may not rely on a single original master and may instead store more than one file, each of which may be used to recreate other versions. For the purposes of this disclosure, a “version” broadly refers to a file that includes at least in part the substance of another file.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein, but is to be accorded the full scope consistent with the language claims, wherein reference to an element in the singular is not intended to mean “one and only one” unless specifically so stated, but rather “one or more.” The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects. Unless specifically stated otherwise, the term “some” refers to one or more. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims. No claim element is to be construed as a means plus function unless the element is expressly recited using the phrase “means for.”