This disclosure relates to data storage, retrieval, and communication. More specifically, this disclosure relates to efficient retrieval and reconstitution of data that has been losslessly reduced using a prime data sieve.
The modern information age is marked by the creation, capture, and analysis of enormous amounts of data. New data is generated from diverse sources, examples of which include purchase transaction records, corporate and government records and communications, email, social media posts, digital pictures and videos, machine logs, signals from embedded devices, digital sensors, cellular phone global positioning satellites, space satellites, scientific computing, and the grand challenge sciences. Data is generated in diverse formats, and much of it is unstructured and unsuited for entry into traditional databases. Businesses, governments, and individuals generate data at an unprecedented rate and struggle to store, analyze, and communicate this data. Tens of billions of dollars are spent annually on purchases of storage systems to hold the accumulating data. Similarly large amounts are spent on computer systems to process the data.
In most modern computer and storage systems, data is accommodated and deployed across multiple tiers of storage, organized as a storage hierarchy. The data that is needed to be accessed often and quickly is placed in the fastest albeit most expensive tier, while the bulk of the data (including copies for backup) is preferably stored in the densest and cheapest storage medium. The fastest and most expensive tier of data storage is the computer system's volatile random access memory or RAM, residing in close proximity to the microprocessor core, and offering the lowest latency and the highest bandwidth for random access of data. Progressively denser and cheaper but slower tiers (with progressively higher latency and lower bandwidth of random access) include non-volatile solid state memory or flash storage, hard disk drives (HDDs), and finally tape drives.
In order to more effectively store and process the growing data, the computer industry continues to make improvements to the density and speed of the data storage medium and to the processing power of computers. However, the increase in the volume of data far outstrips the improvement in capacity and density of the computing and data storage systems. Statistics from the data storage industry in 2014 reveal that new data created and captured in the past couple of years comprises a majority of the data ever captured in the world. The amount of data created in the world to date is estimated to exceed multiple zettabytes (a zettabyte is 1021 bytes). The massive increase in the data places great demands on data storage, computing, and communication systems that must store, process, and communicate this data reliably. This motivates the increased use of lossless data reduction or compression techniques to compact the data so that it can be stored at reduced cost, and likewise processed and communicated efficiently.
A variety of lossless data reduction or compression techniques have emerged and evolved over the years. These techniques examine the data to look for some form of redundancy in the data and exploit that redundancy to realize a reduction of the data footprint without any loss of information. For a given technique that looks to exploit a specific form of redundancy in the data, the degree of data reduction achieved depends upon how frequently that specific form of redundancy is found in the data. It is desirable that a data reduction technique be able to flexibly discover and exploit any available redundancy in the data. Since data originates from a wide variety of sources and environments and in a variety of formats, there is great interest in the development and adoption of universal lossless data reduction techniques to handle this diverse data. A universal data reduction technique is one which requires no prior knowledge of the input data other than the alphabet; hence, it can be applied generally to any and all data without needing to know beforehand the structure and statistical distribution characteristics of the data.
Goodness metrics that can be used to compare different implementations of data compression techniques include the degree of data reduction achieved on the target datasets, the efficiency with which the compression or reduction is achieved, and the efficiency with which the data is decompressed and retrieved for further use. The efficiency metrics assess the performance and cost-effectiveness of the solution. Performance metrics include the throughput or ingest rate at which new data can be consumed and reduced, the latency or time required to reduce the input data, the throughput or rate at which the data can be decompressed and retrieved, and the latency or time required to decompress and retrieve the data. Cost metrics include the cost of any dedicated hardware components required, such as the microprocessor cores or the microprocessor utilization (central processing unit utilization), the amount of dedicated scratch memory and memory bandwidth, as well as the number of accesses and bandwidth required from the various tiers of storage that hold the data. Note that reducing the footprint of the data while simultaneously providing efficient and speedy compression as well as decompression and retrieval has the benefit not only of reducing the overall cost to store and communicate the data but also of efficiently enabling subsequent processing of the data.
Many of the universal data compression techniques currently being used in the industry derive from the Lempel-Ziv compression method developed in 1977 by Abraham Lempel and Jacob Ziv—see e.g., Jacob Ziv and Abraham Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE transactions on information theory, Vol. IT-23, No. 3, May 1977. This method became the basis for enabling efficient data transmission via the Internet. The Lempel-Ziv methods (named LZ77, LZ78 and their variants) reduce the data footprint by replacing repeated occurrences of a string with a reference to a previous occurrence seen within a sliding window of a sequentially presented input data stream. On consuming a fresh string from a given block of data from the input data stream, these techniques search through all strings previously seen within the current and previous blocks up to the length of the window. If the fresh string is a duplicate, it is replaced by a backward reference to the original string. If the number of bytes eliminated by the duplicate string is larger than the number of bytes required for the backward reference, a reduction of the data has been achieved. To search through all strings seen in the window, and to provide maximal string matching, implementations of these techniques employ a variety of schemes, including iterative scanning and building a temporary bookkeeping structure that contains a dictionary of all the strings seen in the window. Upon consuming new bytes of input to assemble a fresh string, these techniques either scan through all the bytes in the existing window, or make references to the dictionary of strings (followed by some computation) to decide whether a duplicate has been found and to replace it with a backward reference (or, alternatively, to decide whether an addition needs to be made to the dictionary).
Another noteworthy method in the prior art re-encodes data in a block or a message based on its entropy to achieve compression. In this method, source symbols are dynamically re-encoded based upon their frequency or probability of occurrence in the data block being compressed, often employing a variable-width encoding scheme so that shorter length codes are used for the more frequent symbols, thus leading to a reduction of the data. For an example of such an entropy-based re-encoding method, see David A. Huffman, “A Method for the Construction of Minimum-Redundancy Codes,” Proceedings of the IRE—Institute of Radio Engineers, September 1952, pp. 1098-1101. This technique is referred to as Huffman re-encoding, and typically needs a first pass through the data to compute the frequencies and a second pass to actually encode the data. Several variations along this theme are also in use.
One example that uses these techniques is a scheme known as “Deflate” which combines the Lempel-Ziv LZ77 compression method with Huffman re-encoding. Deflate provides a compressed stream data format specification that specifies a method for representing a sequence of bytes as a (usually shorter) sequence of bits, and a method for packing the latter bit sequences into bytes. The Deflate scheme was originally designed by Phillip W. Katz of PKWARE, Inc. for the PKZIP archiving utility. See e.g., “String searcher, and compressor using same,” Phillip W. Katz, U.S. Pat. No. 5,051,745, Sep. 24, 1991. U.S. Pat. No. 5,051,745 describes a method for searching a vector of symbols (the window) for a predetermined target string (the input string). The solution employs a pointer array with a pointer to each of the symbols in the window, and uses a method of hashing to filter the possible locations in the window that are required to be searched for an identical copy of the input string. This is followed by scanning and string matching at those locations.
The Deflate scheme is implemented in the zlib library for data compression. Zlib is a software library that is a key component of several software platforms such as Linux, Mac OS X, iOS, and a variety of gaming consoles. The zlib library provides Deflate compression and decompression code for use by zip (file archiving), gzip (single file compression), png (Portable Network Graphics format for losslessly compressed images), and many other applications. Zlib is now widely used for data transmission and storage. Most HTTP transactions by servers and browsers compress and decompress the data using zlib. Similar implementations are increasingly being used by data storage systems.
A paper entitled “High Performance ZLIB Compression on Intel® Architecture Processors,” that was published by Intel Corp. in April 2014 characterizes the compression and performance of an optimized version of the zlib library running on a contemporary Intel processor (Core I7 4770 processor, 3.4 GHz, 8 MB cache) and operating upon the Calgary corpus of data. The Deflate format used in zlib sets the minimum string length for matching to be 3 characters, the maximum length of the match to be 256 characters, and the size of the window to be 32 kilobytes. The implementation provides controls for 9 levels of optimization, with level 9 providing the highest compression but using the most computation and performing the most exhaustive matching of strings, and level 1 being the fastest level and employing greedy string matching. The paper reports a compression ratio of 51% using the zlib level 1 (fastest level) using a single-threaded processor and spending an average of 17.66 clocks/byte of input data. At a clock frequency of 3.4 GHz, this implies an ingest rate of 192 MB/sec while using up a single microprocessor core. The report also describes how the performance rapidly drops to an ingest rate of 38 MB/sec (average of 88.1 clocks/byte) using optimization level 6 for a modest gain in compression, and to an ingest rate of 16 MB/sec (average of 209.5 clocks/byte) using optimization level 9.
Existing data compression solutions typically operate at ingest rates ranging from 10 MB/sec to 200 MB/sec using a single processor core on contemporary microprocessors. To further boost the ingest rate, multiple cores are employed, or the window size is reduced. Even further improvements to the ingest rate are achieved using custom hardware accelerators, albeit at increased cost.
Existing data compression methods described above are effective at exploiting fine-grained redundancy at the level of short strings and symbols in a local window typically the size of a single message or file or perhaps a few files. These methods have serious limitations and drawbacks when they are used in applications that operate on large or extremely large datasets and that require high rates of data ingestion and data retrieval.
One important limitation is that practical implementations of these methods can exploit redundancy efficiently only within a local window. While these implementations can accept arbitrarily long input streams of data, efficiency dictates that a limit be placed on the size of the window across which fine-grained redundancy is to be discovered. These methods are highly compute-intensive and need frequent and speedy access to all the data in the window. String matching and lookups of the various bookkeeping structures are triggered upon consuming each fresh byte (or few bytes) of input data that creates a fresh input string. In order to achieve desired ingest rates, the window and associated machinery for string matching must reside mostly in the processor cache subsystem, which in practice places a constraint on the window size.
For example, to achieve an ingest rate of 200 MB/sec on a single processor core, the available time budget on average per ingested byte (inclusive of all data accesses and compute) is 5 ns., which means 17 clocks using a contemporary processor with operating frequency of 3.4 GHz. This budget accommodates accesses to on-chip caches (which take a handful of cycles) followed by some string matching. Current processors have on-chip caches of several megabytes of capacity. An access to main memory takes over 200 cycles (˜70 ns.), so larger windows residing mostly in memory will further slow the ingest rate. Also, as the window size increases, and the distance to a duplicate string increases, so does the cost to specify the length of backward references, thus encouraging only longer strings to be searched across the wider scope for duplication.
On most contemporary data storage systems, the footprint of the data stored across the various tiers of the storage hierarchy is several orders of magnitude larger than the memory capacity in the system. For example, while a system could provide hundreds of gigabytes of memory, the data footprint of the active data residing in flash storage could be in the tens of terabytes, and the total data in the storage system could be in the range of hundreds of terabytes to multiple petabytes. Also, the achievable throughput of data accesses to subsequent tiers of storage drops by an order of magnitude or more for each successive tier. When the sliding window gets so large that it can no longer fit in memory, these techniques get throttled by the significantly lower bandwidth and higher latency of random IO (Input or Output operations) access to the next levels of data storage.
For example, consider a file or a page of 4 kilobytes of incoming data that can be assembled from existing data by making references to, say, 100 strings of average length of 40 bytes that already exist in the data and are spread across a 256 terabyte footprint. Each reference would cost 6 bytes to specify its address and 1 byte for string length while promising to save 40 bytes. Although the page described in this example can be compressed by more than fivefold, the ingest rate for this page would be limited by the 100 or more IO accesses to the storage system needed to fetch and verify the 100 duplicate strings (even if one could perfectly and cheaply predict where these strings reside). A storage system that offers 250,000 random IO accesses/sec (which means bandwidth of 1 GB/sec of random accesses to pages of 4 KB) could compress only 2,500 such pages of 4 KB size per second for an ingest rate of a mere 10 MB/sec while using up all the bandwidth of the storage system, rendering it unavailable as a storage system.
Implementations of conventional compression methods with large window sizes of the order of terabytes or petabytes will be starved by the reduced bandwidth of data access to the storage system, and would be unacceptably slow. Hence, practical implementations of these techniques efficiently discover and exploit redundancy only if it exists locally, on window sizes that fit in the processor cache or system memory. If redundant data is separated either spatially or temporally from incoming data by multiple terabytes, petabytes, or exabytes, these implementations will be unable to discover the redundancy at acceptable speeds, being limited by storage access bandwidth.
Another limitation of conventional methods is that they are not suited for random access of data. Blocks of data spanning the entire window that was compressed need to be decompressed before any chunk within any block can be accessed. This places a practical limit on the size of the window. Additionally, operations (e.g., a search operation) that are traditionally performed on uncompressed data cannot be efficiently performed on the compressed data.
Yet another limitation of conventional methods (and, in particular, Lempel-Ziv based methods) is that they search for redundancy only along one dimension—that of replacing identical strings by backward references. A limitation of the Huffman re-encoding scheme is that it needs two passes through the data, to calculate frequencies and then re-encode. This becomes slow on larger blocks.
Data compression methods that detect long duplicate strings across a global store of data often use a combination of digital fingerprinting and hashing schemes. This compression process is referred to as data deduplication. The most basic technique of data deduplication breaks up files into fixed-sized blocks and looks for duplicate blocks across the data repository. If a copy of a file is created, each block in the first file will have a duplicate in the second file and the duplicate can be replaced with a reference to the original block. To speed up matching of potentially duplicate blocks, a method of hashing is employed. A hash function is a function that converts a string into a numeric value, called its hash value. If two strings are equal, their hash values are also equal. Hash functions map multiple strings to a given hash value, whereby long strings can be reduced to a hash value of much shorter length. Matching of the hash values will be much faster than matching of two long strings; hence, matching of the hash values is done first, to filter possible strings that might be duplicates. If the hash value of the input string or block matches a hash value of strings or blocks that exist in the repository, the input string can then be compared with each string in the repository that has the same hash value to confirm the existence of the duplicate.
Breaking up a file into fixed-sized blocks is simple and convenient, and fixed-sized blocks are highly desirable in a high-performance storage system. However, this technique has limitations in the amount of redundancy it can uncover, which means that these techniques have low levels of compression. For example, if a copy of a first file is made to create a second file, and if even a single byte of data is inserted into the second file, the alignment of all downstream blocks will change, the hash value of each new block will be computed afresh, and the data deduplication method will no longer find all the duplicates.
To address this limitation in data deduplication methods, the industry has adopted the use of fingerprinting to synchronize and align data streams at locations of matching content. This latter scheme leads to variable-sized blocks based on the fingerprints. Michael Rabin showed how randomly chosen irreducible polynomials can be used to fingerprint a bit-string—see e.g., Michael O. Rabin, “Fingerprinting by Random Polynomials,” Center for Research in Computing Technology, Harvard University, TR-15-81, 1981. In this scheme, a randomly chosen prime number p is used to fingerprint a long character-string by computing the residue of that string viewed as a large integer modulo p. This scheme requires performing integer arithmetic on k-bit integers, where k=log2(p). Alternatively, a random irreducible prime polynomial of order k can be used, and the fingerprint is then the polynomial representation of the data modulo the prime polynomial.
This method of fingerprinting is used in data deduplication systems to identify suitable locations at which to establish chunk boundaries, so that the system can look for duplicates of these chunks in a global repository. Chunk boundaries can be set upon finding fingerprints of specific values. As an example of such usage, a fingerprint can be calculated for each and every 48-byte string in the input data (starting at the first byte of the input and then at every successive byte thereafter), by employing a polynomial of order 32 or lower. One can then examine the lower 13 bits of the 32-bit fingerprint, and set a breakpoint whenever the value of those 13 bits is a pre-specified value (e.g., the value 1). For random data, the likelihood of the 13 bits having that particular value would be 1 in 213, so that such a breakpoint is likely to be encountered approximately once every 8 KB, leading to variable-sized chunks of average size 8 KB. The breakpoints or chunk boundaries will effectively be aligned to fingerprints that depend upon the content of the data. When no fingerprint is found for a long stretch, a breakpoint can be forced at some pre-specified threshold, so that the system is certain to create chunks that are shorter than a pre-specified size for the repository. See e.g., Athicha Muthitacharoen, Benjie Chen, and David Maziéres, “A Low-bandwidth Network File System,” SOSP '01, Proceedings of the eighteenth ACM symposium on Operating Systems Principles, Oct. 21, 2001, pp. 174-187.
The Rabin-Karp string matching technique developed by Michael Rabin and Richard Karp provided further improvements to the efficiency of fingerprinting and string matching (see e.g., Michael O. Rabin and R. Karp, “Efficient Randomized Pattern-Matching Algorithms,” IBM Jour. of Res. and Dev., vol. 31, 1987, pp. 249-260). Note that a fingerprinting method that examines an m byte substring for its fingerprint can evaluate the fingerprinting polynomial function in O(m) time. Since this method would need to be applied on the substring starting at every byte of the, say, n byte input stream, the total effort required to perform fingerprinting on the entire data stream would be O(n×m). Rabin-Karp identified a hash function referred to as a Rolling Hash on which it is possible to compute the hash value of the next substring from the previous one by doing only a constant number of operations, independently of the length of the substring. Hence, after shifting one byte to the right, the computation of the fingerprint on the new m byte string can be done incrementally. This reduces the effort to compute the fingerprint to O(1), and the total effort for fingerprinting the entire data stream to O(n), linear with the size of the data. This greatly speeds up computation and identification of the fingerprints.
Typical data access and computational requirements for the above-described data deduplication methods can be described as follows. For a given input, once fingerprinting is completed to create a chunk, and after the hash value for the chunk is computed, these methods first need one set of accesses to memory and subsequent tiers of storage to search and look up the global hash table that keeps the hash values of all chunks in the repository. This would typically need a first IO access to storage. Upon a match in the hash table, this is followed by a second set of storage IOs (typically one, but could be more than one depending upon how many chunks with the same hash value exist in the repository) to fetch the actual data chunks bearing the same hash value. Lastly, byte-by-byte matching is performed to compare the input chunk to the fetched potentially matching chunks to confirm and identify the duplicate. This is followed by a third storage IO access (to the metadata space) for replacing the new duplicate block with a reference to the original. If there is no match in the global hash table (or if no duplicate is found), the system needs one IO to enter the new block into the repository and another IO to update the global hash table to enter in the new hash value. Thus, for large datasets (where the metadata and global hash table do not fit in memory, and hence need a storage IO to access them) such systems could need an average of three IOs per input chunk. Further improvements are possible by employing a variety of filters so that misses in the global hash table can often be detected without requiring the first storage IO to access the global hash table, thus reducing the number of IOs needed to process some of the chunks down to two.
A storage system that offers 250,000 random IO accesses/sec (which means bandwidth of 1 GB/sec of random accesses to pages of 4 KB) could ingest and deduplicate about 83,333 (250,000 divided by 3 IOs per input chunk) input chunks of average size 4 KB per second, enabling an ingest rate of 333 MB/sec while using up all the bandwidth of the storage system. If only half of the bandwidth of the storage system is used (so that the other half is available for accesses to the stored data), such a deduplication system could still deliver ingest rates of 166 MB/sec. These ingest rates (which are limited by I/O bandwidth) are achievable provided that sufficient processing power is available in the system. Thus, given sufficient processing power, data deduplication systems are able to find large duplicates of data across the global scope of the data with an economy of IOs and deliver data reduction at ingest rates in the hundreds of megabytes per second on contemporary storage systems.
Based on the above description, it should be clear that, while these deduplication methods are effective at finding duplicates of long strings across a global scope, they are effective mainly at finding large duplicates. If there are variations or modifications to the data at a finer grain, the available redundancy will not be found using this method. This greatly reduces the breadth of datasets across which these methods are useful. These methods have found use in certain data storage systems and applications, e.g., regular backup of data, where the new data being backed up has only a few files modified and the rest are all duplicates of the files that were saved in the previous backup. Likewise, data deduplication based systems are often deployed in environments where multiple exact copies of the data or code are made, such as in virtualized environments in datacenters. However, as data evolves and is modified more generally or at a finer grain, data deduplication based techniques lose their effectiveness.
Some approaches (usually employed in data backup applications) do not perform the actual byte-by-byte comparison between the input data and the string whose hash value matches that of the input. Such solutions rely on the low probability of a collision using strong hash functions like the SHA-1. However, due to the finite non-zero probability of a collision (where multiple different strings could map to the same hash value), such methods cannot be considered to provide lossless data reduction, and would not, therefore, meet the high data-integrity requirements of primary storage and communication.
Some approaches combine multiple existing data compression techniques. Typically, in such a setup, the global data deduplication methods are applied to the data first. Subsequently, on the deduplicated dataset, and employing a small window, the Lempel-Ziv string compression methods combined with Huffman re-encoding are applied to achieve further data reduction.
However, in spite of employing all hitherto-known techniques, there continues to be a gap of several orders of magnitude between the needs of the growing and accumulating data and what the world economy can affordably accommodate using the best available modern storage systems. Given the extraordinary requirements of storage capacity demanded by the growing data, there continues to be a need for improved ways to further reduce the footprint of the data. There continues to be a need to develop methods that address the limitations of existing techniques, or that exploit available redundancy in the data along dimensions that have not been addressed by existing techniques. At the same time, it continues to be important to be able to efficiently access and retrieve the data at an acceptable speed and at an acceptable cost of processing. There also continues to be a need to be able to efficiently perform search operations directly on the reduced data.
In summary, there continues to be a long-felt need for lossless data reduction solutions that can exploit redundancy across large and extremely large datasets and provide high rates of data ingestion, data search, and data retrieval.
Embodiments described herein feature techniques and systems that can perform lossless data reduction on large and extremely large datasets while providing high rates of data ingestion and data retrieval, and that do not suffer from the drawbacks and limitations of existing data compression systems.
The following description is presented to enable any person skilled in the art to make and use the invention, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present invention. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein. In this disclosure, when a phrase uses the term “and/or” with a set of entities, the phrase covers all possible combinations of the set of entities unless specified otherwise. For example, the phrase “X, Y, and/or Z” covers the following seven combinations: “X only,” “Y only,” “Z only,” “X and Y, but not Z,” “X and Z, but not Y,” “Y and Z, but not X,” and “X, Y, and Z.”
Efficient Lossless Reduction of Data Using a Prime Data Sieve
In some embodiments described herein, data is organized and stored to efficiently uncover and exploit redundancy globally across the entire dataset. An input data stream is broken up into constituent pieces or chunks called elements, and redundancy among the elements is detected and exploited at a grain finer than the element itself, thus reducing the overall footprint of stored data. A set of elements called Prime Data Elements are identified and used as common and shared building blocks for the dataset, and stored in a structure referred to as the Prime Data Store or Prime Data Sieve. A Prime Data Element is simply a sequence of bits, bytes, or digits of a certain size. Prime Data Elements can be either fixed-sized or variable-sized, depending upon the implementation. Other constituent elements of the input data are derived from Prime Data Elements and are referred to as Derivative Elements. Thus, input data is factorized into Prime Data Elements and Derivative Elements.
The Prime Data Sieve orders and organizes the Prime Data Elements so that the Prime Data Sieve can be searched and accessed in a content-associative manner. Given some input content, with some restrictions, the Prime Data Sieve can be queried to retrieve Prime Data Elements containing that content. Given an input element, the Prime Data Sieve can be searched, using the value of the element, or the values of certain fields in the element, to quickly provide either one or a small set of Prime Data Elements from which the input element can be derived with minimal storage required to specify the derivation. In some embodiments, the elements in the Prime Data Sieve are organized in tree form. A Derivative Element is derived from a Prime Data Element by performing transformations on it, such transformations being specified in a Reconstitution Program, which describes how to generate the Derivative Element from one or more Prime Data Elements. A Distance Threshold specifies a limit on the size of the stored footprint of a Derivative Element. This threshold effectively specifies the maximum allowable distance of Derivative Elements from Prime Data Elements, and also places a limit on the size of the Reconstitution Program that can be used to generate a Derivative Element.
Retrieval of derivative data is accomplished by executing the Reconstitution Program on the one or more Prime Data Elements specified by the derivation.
In this disclosure, the above-described universal lossless data reduction technique may be referred to as a Data Distillation™ process. It performs a function similar to distillation in chemistry—separating a mixture into its constituent elements. The Prime Data Sieve is also referred to as the Sieve or the Data Distillation™ Sieve or the Prime Data Store.
In this scheme, the input data stream is factorized into a sequence of elements, each element being either a Prime Data Element or a Derivative Element which derives from one or more Prime Data Elements. Each element is transformed into a losslessly reduced representation which, in the case of a Prime Data Element includes a reference to the Prime Data Element, and in the case of a Derivative Element includes references to the one or more Prime Data Elements involved in the derivation, and a description of the Reconstitution Program. Thus the input data stream is factorized into a sequence of elements that are in the losslessly reduced representation. This sequence of elements (appearing in the losslessly reduced representation) is referred to as a distilled data stream or distilled data. The sequence of elements in the distilled data has a one-to-one correspondence to the sequence of elements in the input data, i.e., the nth element in the sequence of elements in the distilled data corresponds to the nth element in the sequence of elements in the input data.
The universal lossless data reduction technique described in this disclosure receives an input data stream and converts it into the combination of a distilled data stream and a Prime Data Sieve, such that the sum of the footprints of the distilled data stream and the Prime Data Sieve is usually smaller than the footprint of the input data stream. In this disclosure, the distilled data stream and the Prime Data Sieve are collectively called the losslessly reduced data, and will also be referred to interchangeably as the “reduced data stream” or “reduced data” or “Reduced Data”. Likewise, for the sequence of elements that is produced by the lossless data reduction techniques described in this disclosure, and that appear in the losslessly reduced format, the following terms are used interchangeably: “reduced output data stream,” “reduced output data”, “distilled data stream,” “distilled data”, and “Distilled Data.”
A sequence of bytes is received from an input data stream and presented as Input Data 102 to Data Reduction Apparatus 103, also referred to as the Data Distillation™ Apparatus. Parser & Factorizer 104 parses the incoming data and breaks it into chunks or candidate elements. The Factorizer decides where in the input stream to insert breaks to slice up the stream into candidate elements. Once two consecutive breaks in the data have been identified, a Candidate Element 105 is created by the Parser and Factorizer and presented to Prime Data Sieve 106, also referred to as the Data Distillation™ Sieve.
Data Distillation™ Sieve or Prime Data Sieve 106 contains all the Prime Data Elements (labeled as PDEs in
The Sieve or Prime Data Sieve 106 can be initialized with a set of Prime Data Elements whose values are spread across the data space. Alternatively, the Sieve can start out empty, and Prime Data Elements can be added to it dynamically as data is ingested, in accordance with the Data Distillation™ process described herein in reference to
Deriver 110 receives the Candidate Element 105 and the retrieved Prime Data Elements suitable for derivation 107 (which are content associatively retrieved from the Prime Data Sieve 106), determines whether or not the Candidate Element 105 can be derived from one or more of these Prime Data Elements, generates Reduced Data Components 115 (comprised of references to the relevant Prime Data Elements and the Reconstitution Program), and provides updates 114 to the Prime Data Sieve. If the candidate element is a duplicate of a retrieved Prime Data Element, the Deriver places into the Distilled Data 108 a reference (or pointer) to the Prime Data Element located in the Prime Data Sieve, and also an indicator that this is a Prime Data Element. If no duplicate is found, the Deriver expresses the candidate element as the result of one or more transformations performed on one or more retrieved Prime Data Elements, where the sequence of transformations is collectively referred to as the Reconstitution Program, e.g., Reconstitution Program 119A. Each derivation may require its own unique program to be constructed by the Deriver. The Reconstitution Program specifies transformations such as insertions, deletions, replacements, concatenations, arithmetic, and logical operations that can be applied to the Prime Data Elements. Provided the footprint of the Derivative Element (calculated as the size of the Reconstitution Program plus the size of the references to the required Prime Data Elements) is within a certain specified Distance Threshold with respect to the candidate element (to enable data reduction), the candidate element is reformulated as a Derivative Element and replaced by the combination of the Reconstitution Program and references to the relevant Prime Data Element (or elements)—these form the Reduced Data Components 115 in this case. If the threshold is exceeded, or if no suitable Prime Data Element was retrieved from the Prime Data Sieve, the Prime Data Sieve may be instructed to install the candidate as a fresh Prime Data Element. In this case, the Deriver places into the distilled data a reference to the newly added Prime Data Element, and also an indicator that this is a Prime Data Element.
A request for Retrieval of data (e.g., Retrieval Requests 109) can be in the form of either a reference to a location in the Prime Data Sieve containing a Prime Data Element, or in the case of a derivative, a combination of such a reference to a Prime Data Element and an associated Reconstitution Program (or in the case of a derivative based on multiple Prime Data Elements, a combination of the references to multiple Prime Data Elements and an associated Reconstitution Program). Using the one or more references to Prime Data Elements in the Prime Data Sieve, Retriever 111 can access the Prime Data Sieve to fetch the one or more Prime Data Elements and provide the one or more Prime Data Elements as well as the Reconstitution Program to Reconstitutor 112, which executes the transformations (specified in the Reconstitution Program) on the one or more Prime Data Elements to generate the Reconstituted Data 116 (which is the data that was requested) and deliver it to the Retrieved Data Output 113 in response to the data retrieval request.
In a variation of this embodiment, the Prime Data Elements may be stored in the Sieve in compressed form (using techniques known in the prior art, including Huffman Coding and Lempel Ziv methods) and decompressed when needed. This has the advantage of reducing the overall footprint of the Prime Data Sieve. The only constraint is that Content Associative Mapper 121 must continue to provide Content Associative Access to the Prime Data Elements as before.
In
In a variation of this embodiment, the Prime Data Elements may be stored in the Sieve in compressed form (using techniques known in the prior art, including Huffman Coding and Lempel Ziv methods) and decompressed when needed. Likewise, Prime Reconstitution Programs may be stored in the Prime Reconstitution Program Sieve in compressed form (using techniques known in the prior art, including Huffman Coding and Lempel Ziv methods) and decompressed when needed. This has the advantage of reducing the overall footprint of the Prime Data Sieve and Prime Reconstitution Program Sieve. The only constraint is that Content Associative Mappers 121 and 122 must continue to provide Content Associative Access to the Prime Data Elements and Prime Reconstitution Programs as before.
The foregoing descriptions of methods and apparatuses for data reduction that factorize input data into elements and derive these from Prime Data Elements resident in a Prime Data Sieve have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art.
In variations of the embodiments shown in
A first check is performed to see if the candidate is a duplicate of any of these elements (operation 210). This check can be sped up using any suitable hashing technique. If the candidate is identical to a Prime Data Element retrieved from the Prime Data Sieve (“Yes” branch of operation 210), the entry in the distilled data created for the candidate element is replaced by a reference to this Prime Data Element and an indication that this entry is a Prime Data Element (operation 220). If no duplicate is found (“No” branch of operation 210), the entries retrieved from the Prime Data Sieve based on the candidate element are regarded as entries from which the candidate element is potentially derivable. The following is an important, novel, and non-obvious feature of the Prime Data Sieve: when a duplicate is not found in the Prime Data Sieve, the Prime Data Sieve can return Prime Data Elements that, although not identical to the candidate element, are elements from which the candidate element may potentially be derived by applying one or more transformations to the Prime Data Element(s). The process can then perform analysis and computation to derive the candidate element from either the most suitable Prime Data Element or a set of suitable Prime Data Elements (operation 212). In some embodiments, the derivation expresses the candidate element as the result of transformations performed on the one or more Prime Data Elements, such transformations being collectively referred to as the Reconstitution Program. Each derivation may require its own unique program to be constructed. In addition to constructing the Reconstitution Program, the process can also compute a distance metric that generally indicates a level of storage resources and/or computational resources that are required to store the reformulation of the candidate element and to reconstitute the candidate element from the reformulation. In some embodiments, the footprint of the Derivative Element is used as a measure of the distance of the candidate from the Prime Data Element(s)—specifically, a Distance metric can be defined as the sum of the size of the Reconstitution Program plus the size of the references to the one or more Prime Data Elements involved in the derivation. The derivation with the shortest Distance can be chosen. The Distance for this derivation is compared with a Distance Threshold (operation 214), and if the Distance does not exceed the Distance Threshold, the derivation is accepted (“Yes” branch of operation 214). In order to yield data reduction, the Distance Threshold must always be less than the size of the candidate element. For example, the Distance Threshold may be set to 50% of the size of the candidate element, so that a derivative will only be accepted if its footprint is less than or equal to half the footprint of the candidate element, thereby ensuring a reduction of 2× or greater for each candidate element for which a suitable derivation exists. The Distance Threshold can be a predetermined percentage or fraction, either based on user-specified input or chosen by the system. The Distance Threshold may be determined by the system based on static or dynamic parameters of the system. Once the derivation is accepted, the candidate element is reformulated and replaced by the combination of the Reconstitution Program and references to the one or more Prime Data Elements. The entry in the distilled data created for the candidate element is replaced by the derivation, i.e., it is replaced by an indication that this is a derivative element, along with the Reconstitution Program plus references to the one or more Prime Data Elements involved in the derivation (operation 218). On the other hand, if the Distance for the best derivation exceeds the Distance Threshold (“No” branch of operation 214), none of the possible derivatives will be accepted. In that case, the candidate element may be allocated and entered into the Sieve as a new Prime Data Element, and the entry in the distilled data created for the candidate element will be a reference to the newly created Prime Data Element along with an indication that this is a Prime Data Element (operation 216).
Finally, the process can check if there are any additional candidate elements (operation 222), and return to operation 204 if there are more candidate elements (“Yes” branch of operation 222), or terminate the process if there are no more candidate elements (“No” branch of operation 222).
A variety of methods can be employed to perform operation 202 in
One important function of the Prime Data Sieve is to provide content-associative lookup based upon a candidate element presented to it, and to quickly provide one or a small set of Prime Data Elements from which a candidate element can be derived with minimal storage needed to specify the derivation. This is a difficult problem given a large dataset. Given terabytes of data, even with kilobyte-sized elements, there are billions of elements to search and choose from. The problem is even more severe on larger datasets. It becomes important to organize and order the elements using a suitable technique and then detect similarities and derivability within that organization of the elements, to be able to quickly provide a small set of suitable Prime Data Elements.
The entries in the Sieve could be ordered based upon the value of each element (i.e., Prime Data Element), so that all entries could be arranged by value in ascending or descending order. Alternatively, they could be ordered along a principal axis that is based upon the value of certain fields in the element, followed by subordinate axes that use the rest of the content of the element. In this context, a field is a set of contiguous bytes from the content of the element. Fields could be located by applying a method of fingerprinting to the contents of the element so that the location of a fingerprint identifies the location of a field. Alternatively, certain fixed offsets inside the content of the element could be chosen to locate a field. Other methods could also be employed to locate a field, including, but not limited to, parsing the element to detect certain declared structure, and locating fields within that structure.
In yet another form of organization, certain fields or combinations of fields within the element could be considered as dimensions, so that a concatenation of these dimensions followed by the rest of the content of each element could be used to order and organize the data elements. In general, the correspondence or mapping between fields and dimensions can be arbitrarily complex. For example, in some embodiments exactly one field may map to exactly one dimension. In other embodiments, a combination of multiple fields, e.g., F1, F2, and F3, may map to a dimension. The combining of fields may be achieved either by concatenating the two fields or by applying any other suitable function to them. The important requirement is that the arrangement of fields, dimensions, and the rest of the content of an element that is used to organize elements must enable all Prime Data Elements to be uniquely identified by their content and ordered in the Sieve.
In yet another embodiment, a certain suitable function (such as an algebraic or arithmetic transformation) can be applied to the element, where the function has the property that the result of the function uniquely identifies each element. In one such embodiment, each element is divided by a prime polynomial or some chosen number or value and the result of the division (which comprises the quotient and remainder pair) is taken as the function used to organize and order the element in the prime data sieve. For example, the bits comprising the remainder can form the leading bytes of the result of the function followed by the bits comprising the quotient. Or alternatively, the bits comprising the quotient can be used to form the leading bytes of the result of the function followed by the bits comprising the remainder. For a given divisor used to divide the input element, the quotient and remainder pair will uniquely identify the element, and hence this pair can be used to form the result of the function that is used to organize and order the element in the prime data sieve. By applying this function to each element, prime data elements can be organized in the sieve based on the result of function. The function will still uniquely identify each prime data element and will provide an alternate method to sort and organize the prime data elements in the prime data sieve.
In yet another embodiment, a certain suitable function (such as an algebraic or arithmetic transformation) can be applied to each field of the element, where the function has the property that the result of the function uniquely identifies that field. For example, a function such as division by a suitable polynomial or number or value may be executed on successive fields or successive portions of the content of each element, so that the concatenation of the results of the successive functions may be used to order and organize the element in the prime data sieve. Note that, on each field, a different polynomial may be used for division. Each function will furnish a suitably ordered concatenation of the bits from the quotient and remainder emitted by the division operation for that portion or field. Each prime data element can be ordered and organized in the sieve by using this concatenation of functions that are applied to the fields of the element. The concatenation of functions will still uniquely identify each prime data element and will provides an alternate method to sort and organize the prime data elements in the prime data sieve.
In some embodiments, the contents of an element can be represented as an expression as follows: Element=Head.*sig1.*sig2.* . . . sigI.* . . . sigN.*Tail, where “Head” is a sequence of bytes comprising the leading bytes of the element, “Tail” is a sequence of bytes comprising the concluding bytes of the element, and “sig1”, “sig2”, “sigI”, and “sigN” are various signatures or patterns or regular expressions or sequences of bytes of certain lengths within the body of the content of the element that characterize the element. The expression “.*” between the various signatures is the wildcard expression, i.e., it is the regular expression notation that allows any number of intervening bytes of any value other than the signature that follows the expression “.*”. In some embodiments, the N-tuple (sig1, sig2, . . . sigI, . . . sigN) is referred to as the Skeletal Data Structure or the Skeleton of the element, and can be regarded as a reduced and essential subset or essence of the element. In other embodiments, the (N+2)-tuple (Head, sig1, sig2, . . . sigI, . . . sigN, Tail) is referred to as the Skeletal Data Structure or the Skeleton of the element. Alternatively, an N+1 tuple may be employed that includes either the Head or the Tail along with the rest of the signatures.
A method of fingerprinting can be applied to the content of the element to determine the locations of the various components (or signatures) of the Skeletal Data Structure within the content of the element. Alternatively, certain fixed offsets inside the content of the element could be chosen to locate a component. Other methods could also be employed to locate a component of the Skeletal Data Structure, including, but not limited to, parsing the element to detect certain declared structure, and locating components within that structure. Prime Data Elements can be ordered in the Sieve based on their Skeletal Data Structure. In other words, the various components of the Skeletal Data Structure of the element can be considered as Dimensions, so that a concatenation of these dimensions followed by the rest of the content of each element could be used to order and organize the Prime Data Elements in the Sieve.
Some embodiments factorize the input data into candidate elements, where the size of each candidate element is substantially larger than the size of a reference needed to access all such elements in the global dataset. One observation about data that is broken into such data chunks (and that is being accessed in a content-associative fashion) is that the actual data is very sparse with respect to the total possible values that the data chunk can specify. For example, consider a 1 zettabyte dataset. One needs about 70 bits to address every byte in the dataset. At a chunk size of 128 bytes (1024 bits), there are approximately 263 chunks in the 1 zettabyte dataset, so that one needs 63 bits (fewer than 8 bytes) to address all of the chunks. Note that an element or chunk of 1024 bits could have one of 21024 possible values, while the number of actual values of the given chunks in the dataset is at most 263 (if all the chunks are distinct). This indicates that the actual data is extremely sparse with respect to the number of values that can be reached or named by the content of an element. This enables use of a tree structure, which is well-suited for organizing very sparse data in a manner that enables efficient content-based lookups, allows new elements to be efficiently added to the tree structure, and is cost-effective in terms of the incremental storage needed for the tree structure itself. Although there are only 263 distinct chunks in the 1 zettabyte dataset, thus requiring only 63 differentiating bits of information to tell them apart, the relevant differentiating bits might be spread across the entire 1024 bits of the element and occur at different locations for each element. Therefore, to fully differentiate all the elements, it is insufficient to examine only a fixed 63 bits from the content, but rather the entire content of the element needs to participate in the sorting of the elements, especially in a solution that provides true content-associative access to any and every element in the dataset. In the Data Distillation™ framework, it is desirable to be able to detect derivability within the framework used to order and organize the data. Keeping all of the above in mind, a tree structure based upon the content (which progressively differentiates the data as more of the content is examined) is a suitable organization to order and differentiate all the elements in the factorized dataset. Such a structure provides numerous intermediate levels of subtrees which can be treated as groupings of derivable elements or groupings of elements with similar properties of derivability. Such a structure can be hierarchically augmented with metadata characterizing each subtree or with metadata characterizing each element of data. Such a structure can effectively communicate the composition of the entire data it contains, including the density, proximity, and distribution of actual values in the data.
Some embodiments organize the Prime Data Elements in the Sieve in tree form. Each Prime Data Element has a distinct “Name” which is constructed from the entire content of the Prime Data Element. This Name is designed to be sufficient to uniquely identify the Prime Data Element and to differentiate it with respect to all other elements in the tree. There are several ways in which the Name can be constructed from the content of the Prime Data Element. The Name may be simply comprised of all the bytes of the Prime Data Element, with these bytes appearing in the Name in the same order as they exist in the Prime Data Element. In another embodiment, certain fields or combinations of fields referred to as Dimensions (where fields and dimensions are as described earlier) are used to form the leading bytes of the Name, with the rest of the content of the Prime Data Element forming the rest of the Name, so that the entire content of the Prime Data Element is participating to create the complete and unique Name of the element. In yet another embodiment, the fields of the Skeletal Data Structure of the element are chosen as Dimensions (where fields and dimensions are as described earlier), and are used to form the leading bytes of the Name, with the rest of the content of the Prime Data Element forming the rest of the Name, so that the entire content of the Prime Data Element is participating to create the complete and unique Name of the element.
In some embodiments, the Name for the element can be computed by performing algebraic or arithmetic transformations on the element while retaining the property that each name uniquely identifies each element. In one such embodiment, each element is divided by a prime polynomial or some chosen number or value and the result of the division (which is the quotient and remainder pair) is used to form the Name of the element. For example, the bits comprising the remainder can form the leading bytes of the Name followed by the bits comprising the quotient. Or alternatively, the bits comprising quotient can be used to form the leading bytes of the Name followed by the bits comprising the remainder. For a given divisor used to divide the input element, the quotient and remainder pair will uniquely identify the element, and hence this pair can be used to form the Name of each element. Using this formulation of the Name, prime data elements can be organized in the sieve based on their Names. The Name will still uniquely identify each prime data element and will provide an alternate method to sort and organize the prime data elements in the prime data sieve.
In another embodiment, a variation of this method of generating a Name (which involves division and extracting quotient/remainder pairs) may be employed, where division by a suitable polynomial or number or value may be executed on successive fields or successive portions of the content of each element, yielding successive portions of the Name for each element (each portion being a suitably ordered concatenation of the bits from the quotient and remainder emitted by the division operation for that portion or field). Note that, on each field, a different polynomial may be used for division. Using this formulation of the Name, prime data elements can be organized in the sieve based on their Names. The Name will still uniquely identify each prime data element and will provides an alternate method to sort and organize the prime data elements in the prime data sieve.
The Name of each Prime Data Element is used to order and organize the Prime Data Elements in the tree. For most practical datasets, even those that are very large in size (such as a 1 zettabyte dataset, comprised of 258 elements of, say, 4 KB size), it is expected that a small subset of the bytes of the Name will often serve to sort and order the majority of the Prime Data Elements in the tree.
At the second level of the trie, the second most significant byte of the Name of each Prime Data Element is used to further divide each group of the Prime Data Elements into smaller subgroups. For example, in
The process of subdivision continues at each level of the trie creating links from a parent node to each child node, where a child node represents a subset of the Prime Data Elements represented by the parent node. This process continues until there are only individual Prime Data Elements at the leaves of the trie. A leaf node represents a group of leaves. In
This example illustrates how, in the given mix of Prime Data Elements, only a subset of the bytes of the Name serves to identify Prime Data Elements in the tree, and the entire Name is not needed to arrive at a unique Prime Data Element. Also, Prime Data Elements or groups of Prime Data Elements might each require a different number of significant bytes to be able to uniquely identify them. Thus, the depth of the trie from the root node to a Prime Data Element could vary from one Prime Data Element to another. Furthermore, in the trie, each node might have a different number of links descending to subtrees below.
In such a trie, each node has a name comprised of the sequence of bytes that specifies how to reach this node. For example, the name for Node 304 is “21”. Also, the subset of bytes from the Name of the element that uniquely identifies the element in the current distribution of elements in the tree is the “Path” to this Prime Data Element from the root node. For example, in
The trie structure described here may create deep trees (i.e., trees that have many levels) since every differentiating byte of the Name of an element in the tree adds one level of depth to the trie.
Note that the tree data structures in
We now describe a method for content-associative lookup of the trie structure, given an input candidate element. This method involves navigation of the trie structure using the Name of the candidate element, followed by subsequent analysis and screening to decide what to return as the result of the overall content-associative lookup. In other words, the trie navigation process returns a first outcome, and then analysis and screening is performed on that outcome to determine the result of the overall content-associative lookup.
To begin the trie navigation process, the value of the most significant byte from the Name of the candidate element will be used to select a link (denoted by that value) from the root node to a subsequent node representing a subtree of Prime Data Elements with that same value in the most significant byte of their Names. Proceeding from this node, the second byte from the Name of the candidate element is examined and the link denoted by that value is selected, thus advancing one level deeper (or lower) into the trie and selecting a smaller subgroup of Prime Data Elements that now share with the candidate element at least two significant bytes from their Names. This process continues until a single Prime Data Element is reached or until none of the links match the value of the corresponding byte from the Name of the candidate element. Under either of these conditions, the tree navigation process terminates. If a single Prime Data Element is reached, it may be returned as the outcome of the trie navigation process. If not, one alternative is to report a “miss”. Another alternative is to return multiple Prime Data Elements that are in the subtree that is rooted at the node where the navigation terminated.
Once the trie navigation process has terminated, other criteria and requirements may be used to analyze and screen the outcome of the trie navigation process to determine what should be returned as the result of the content-associative lookup. For example, when either a single Prime Data Element or multiple Prime Data Elements are returned by the trie navigation process, there could be an additional requirement that they share a certain minimum number of bytes with the Name of the candidate element before qualifying to be returned as the result of the content-associative lookup (otherwise the content-associative lookup returns a miss). Another example of a screening requirement could be that, if the trie navigation process terminates without reaching a single Prime Data Element so that multiple Prime Data elements (rooted at the node where the trie navigation terminated) are returned as the outcome of the trie navigation process, then these multiple Prime Data Elements will qualify to be returned as the result of the overall content-associative lookup only if the number of these elements is fewer than a certain specified limit (otherwise the content-associative lookup returns a miss). Combinations of multiple requirements may be employed to determine the result of the content-associative lookup. In this manner, the lookup process will either report a “miss” or return a single Prime Data Element, or if not a single Prime Data Element, then a set of Prime Data Elements that are likely to be good starting points for deriving the candidate element.
Note that, during tree navigation (using content from a given candidate element), upon arriving at any parent node in the tree, the tree navigation process needs to ensure that sufficient bytes from the Name of the candidate element are examined to unambiguously decide which link to choose. To choose a given link, the bytes from the Name of the candidate must match all the bytes that denote the transition to that particular link. Once again, in such a tree, each node of the tree has a name comprised of the sequence of bytes that specifies how to reach this node. For example, the name of node 309 can be “347” because it represents a group of Prime Data Elements (e.g., elements 311 and 312) with the 3 leading bytes of their Names being “347”. Upon a lookup of the tree using a candidate element with the leading 3 bytes of the Name being 347, this data pattern causes the tree navigation process to reach node 309 as shown in
For diverse and sparse data, the tree structure in
For example, as shown in
Upon receiving a candidate element, the process applies the same technique described above to determine the Name of the candidate element, and uses this Name to navigate the tree for a content-associative lookup. Thus, the same and consistent treatment is applied to Prime Data Elements (upon their installation into the tree) and candidate elements (upon receiving them from the Parser & Factorizer) in order to create their Names. The tree navigation process uses the Name of the candidate element to navigate the tree. In this embodiment, if no fingerprint is found in the candidate element, the tree navigation process navigates down the subtree that organizes and contains Prime Data Elements whose content did not evaluate to the fingerprint.
For each of the 4 scenarios, the Name of an element is created as follows: (1) When both fingerprints are found, x bytes from the location identified by “Fingerprint 1” can be extracted to form “Dimension 1” and y bytes from the location identified by “Fingerprint 2” can be extracted to form “Dimension 2” and these x+y bytes can be used as the leading bytes N1N2 . . . Nx+y of the Name of each such element in
(1) both Fingerprint1 and Fingerprint2 found:
N1−Nx←Bi+1−Bi+x=x bytes from Dimension 1
Nx+1−Nx+y←Bj+1−Bj+y=y bytes from Dimension 2
Nx+y+1 . . . Nt=Rest of the bytes (from the Candidate Element of size t bytes)=Bi+x+1Bi+x+2Bi+x+3 . . . BjBj+y+1Bj+y+2Bj+y+3 . . . BtB1B2B3 . . . Bi
(2) Fingerprint1 found, Fingerprint2 not found:
N1−Nx←Bi+1−Bi+x=x bytes from Dimension 1
Nx+1 . . . Nt=Rest of the bytes (from the Candidate Element of size t bytes)=Bi+x+1Bi+x+2Bi+x+3 . . . BtB1B2B3 . . . Bi
(3) Fingerprint2 found, Fingerprint1 not found:
N1−Ny←Bj+1−Bj+y=y bytes from Dimension 2
Ny+1=Rest of the bytes (from the Candidate Element of size t bytes)=Bj+y+1Bj+y+2Bj+y+3 . . . BtB1B2B3 . . . Bj
(4) No fingerprints found:
N1−Nx←B1−Bt
Upon receiving a candidate element, the process applies the same technique described above to determine the Name of the candidate element. In this embodiment, the 4 methods of Name construction described above (depending upon whether fingerprint 1 and fingerprint 2 are found or not) are applied to the candidate element just as they were to Prime Data Elements when they were entered into the Sieve. Thus, the same and consistent treatment is applied to Prime Data Elements (upon their installation into the tree) and to candidate elements (upon receiving them from the Parser & Factorizer) in order to create their Names. The tree navigation process uses the Name of the candidate element to navigate the tree for a content-associative lookup.
If the content-associative lookup is successful, it will yield Prime Data Elements that have the same patterns at the locations of the specific dimensions as the candidate element. For example, if both fingerprints are found in the candidate element, the tree navigation process will take it down link 354 of the tree, starting from the root node. If the candidate element has the pattern “99 . . . 3” as ‘Dimension 1” and the pattern “7 . . . 5” as ‘Dimension 2”, the tree navigation process will arrive at Node 334. This reaches a subtree containing two Prime Data Elements (PDE 352 and PDE 353), which are likely targets for the derivation. Additional analysis and screening is performed (by first examining the metadata, and if needed, by subsequently fetching and examining the actual Prime Data Elements) to determine which Prime Data Element is best suited for the derivation. Thus, embodiments described herein identify a variety of tree structures that can be used in the Sieve. Combinations of such structures or variations thereof could be employed to organize the Prime Data Elements. Some embodiments organize the Prime Data Elements in tree form, wherein the entire content of the element is used as the Name of the element. However, the sequence in which bytes appear in the Name of the element is not necessarily the sequence in which the bytes appear in the element. Certain fields of the element are extracted as dimensions and used to form the leading bytes of the Name, and the rest of the bytes of the element make up the rest of the Name. Using these Names, the elements are ordered in the Sieve in tree form. The leading digits of the Name are used to differentiate the higher branches (or links) of the tree, and the rest of the digits are used to progressively differentiate all branches (or links) of the tree. Each node of the tree could have a different number of links emanating from that node. Also, each link from a node could be differentiated and denoted by a different number of bytes, and the description of these bytes could be accomplished through use of regular expressions and other powerful ways to express their specification. All these features lead to a compact tree structure. At the leaf nodes of the tree reside references to individual Prime Data Elements.
In one embodiment, a method of fingerprinting can be applied to the bytes comprising the Prime Data Element. A number of bytes residing at the location identified by the fingerprint can be used to make up a component of the element Name. One or more components could be combined to provide a dimension. Multiple fingerprints could be used to identify multiple dimensions. These dimensions are concatenated and used as the leading bytes of the Name of the element, with the rest of the bytes of the element comprising the rest of the Name of the element. Since the dimensions are located at positions identified by fingerprints, it increases the likelihood that the Name is being formed from consistent content from each element. Elements that have the same value of content at the fields located by the fingerprint will be grouped together along the same leg of the tree. In this fashion, similar elements will be grouped together in the tree data structure. Elements with no fingerprints found in them can be grouped together in a separate subtree, using an alternative formulation of their Names.
In one embodiment, a method of fingerprinting can be applied to the content of the element to determine the locations of the various components (or signatures) of the Skeletal Data Structure (described earlier) within the content of the element. Alternatively, certain fixed offsets inside the content of the element could be chosen to locate a component. Other methods could also be employed to locate a component of the Skeletal Data Structure of the element, including, but not limited to, parsing the element to detect certain declared structure, and locating components within that structure. The various components of the Skeletal Data Structure of the element can be considered as Dimensions, so that a concatenation of these dimensions followed by the rest of the content of each element is used to create the Name of each element. The Name is used to order and organize the Prime Data Elements in the tree.
In another embodiment, the element is parsed in order to detect certain structure in the element. Certain fields in this structure are identified as dimensions. Multiple such dimensions are concatenated and used as the leading bytes of the Name, with the rest of the bytes of the element comprising the rest of the Name of the element. Since the dimensions are located at positions identified by parsing the element and detecting its structure, it increases the likelihood that the Name is being formed from consistent content from each element. Elements that have the same value of content at the fields located by the parsing will be grouped together along the same leg of the tree. In this fashion, once again, similar elements will be grouped together in the tree data structure.
It is worth noting that there is a relationship between the Distance Threshold and the resulting depth of the tree in the prime data sieve. For example, if all input elements were fixed sized elements of 4096 bytes, and if the Distance Threshold was set to 60%, it would mean that Reconstitution Programs of a size of 60% of 4 KB or at most 2458 bytes would be permissible. In such a case the maximum depth of the prime data sieve in terms of bytes would be roughly 40% of the size of an element or 1638 bytes. The rest of the bytes could simply be furnished from the bytes of the candidate element and placed in the Reconstitution Program. This would mean that the maximum number of bytes of the Name of an incoming candidate element that would need to be used to traverse the tree data structure to reach a prime data element that would satisfy a derivation would be roughly 1638 bytes. Note that in the actual derivation created, there would be a need to accommodate the bytes required to specify the various offsets where strings from the candidate element would be appended to portions of the prime data element to create the derivation.
In some embodiments, each node in the tree data structure contains a self-describing specification. Tree nodes have one or more children. Each child entry contains information on the differentiating bytes on the link to the child, and a reference to the child node. A child node may be a tree node or leaf node.
In order to increase the efficiency with which fresh Prime Data Elements get installed into the tree, some embodiments incorporate an additional field into the leaf node data structure for each Prime Data Element that is kept at the leaf node of the tree. Note that when a fresh element has to be inserted into the tree, additional bytes of the Name or content of each of the Prime Data Elements in the subtree in question might be needed in order to decide where in the subtree to insert the fresh element, or whether to trigger a further partitioning of the subtree. The need for these additional bytes could require fetching several of the Prime Data Elements in question in order to extract the relevant differentiating bytes for each of these elements with respect to the fresh element. In order to reduce and optimize (and, in most cases, fully eliminate) the number of IOs needed for this task, the data structure in the leaf node includes a certain number of additional bytes from the Name of each Prime Data Element under that leaf node. These additional bytes are referred to as Navigation Lookahead bytes, and assist in sorting the Prime Data Elements with respect to a fresh incoming element. The Navigation Lookahead bytes for a given Prime Data Element are installed into the leaf node structure upon installation of the Prime Data Element into the Sieve. The number of bytes to be retained for this purpose could be chosen statically or dynamically using a variety of criteria, including the depth of the subtree involved and the density of Prime Data Elements in that subtree. For example, for Prime Data Elements being installed at shallow levels of the tree, the solution may add a longer Navigation Lookahead Field than for Prime Data Elements residing in a very deep tree. Also, when a fresh element is being installed into the Sieve, and if there are already many Prime Data Elements in the existing target subtree (with increased likelihood of an imminent repartitioning), then additional Navigation Lookahead bytes could be retained for the fresh Prime Data Element when it is being installed into the subtree.
In some embodiments, the various branches of the tree are used to map the various data elements into groups or ranges formed by interpreting the differentiating bytes along a link leading to a child subtree as a range delimiter. All elements in that child subtree will be such that the values of the corresponding bytes in the element will be less than or equal to the values for the differentiating bytes specified for the link to the particular child subtree. Thus each subtree will now represent a group of elements whose values fall within a specific range. Within a given subtree, each subsequent level of the tree will progressively divide the set of elements into smaller ranges. This embodiment provides a different interpretation to the components of the self-describing tree node structure shown in
For example, in
The tree navigation process can be modified to accommodate this new range node. Upon arriving at a range node, to choose a given link emanating from that node, the bytes from the Name of the candidate must fall within the range defined for that particular link. If the value of the bytes from the Name of the candidate is larger than the value of the corresponding bytes in all the links, the candidate element falls outside of all ranges spanned by the subtree below—in this case (referred to as an “out of range condition”) a miss condition is detected and the tree navigation process terminates. If the leading bytes of the Name of the candidate element fall within the range determined by the corresponding differentiating bytes along a link leading to the child subtree, tree navigation continues to that subtree below. Unless it terminates due to an “out of range condition”, tree navigation can progressively continue deeper down the tree until it reaches a leaf node data structure.
This kind of range node can be employed in the tree structure in conjunction with the trie nodes described in
The foregoing descriptions of methods and apparatuses for representing and using tree nodes and leaf nodes have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art.
Upon being presented a candidate element as input, the tree node and leaf node structures described above can be traversed and a content-associative lookup of the tree can be performed based upon the content of the candidate element. The Name of the candidate element will be constructed from the bytes of the candidate element just as the Name of a Prime Data Element was constructed from the content of the Prime Data Element when it was installed in the Sieve. Given an input candidate element, the method for content-associative lookup of the tree involves navigation of the tree structure using the Name of the candidate element, followed by subsequent analysis and screening to decide what to return as the result of the overall content-associative lookup. In other words, the tree navigation process returns a first outcome, and then analysis and screening is performed on that outcome to determine the result of the overall content-associative lookup.
If there are any Prime Data Elements with the same leading bytes of Name as the candidate (or bytes such that they fall within the same range), the tree will identify that subset of Prime Data Elements in the form of a subtree of elements denoted by a link. In general, each tree node or leaf node can store information that enables the tree navigation process to determine which outgoing link, if any, is to be selected to navigate to the next lower level in the tree based upon the corresponding bytes of the Name of the input element, and the identity of the node that is reached when the tree is navigated along the selected link. If each node contains this information, then the tree navigation process can recursively navigate down each level in the tree until no matches are found (at which point the tree navigation process can return a set of Prime Data Elements that exists in the subtree rooted at the current node) or a Prime Data Element is reached (at which point the tree navigation process can return the Prime Data Element and any associated metadata).
Once the tree navigation process has terminated, other criteria and requirements may be used to analyze and screen the outcome of the tree navigation process to determine what should be returned as the result of the overall content-associative lookup. First, one could pick the Prime Data Element with the most number of leading bytes from the Name in common with the candidate. Second, when either a single Prime Data Element or multiple Prime Data Elements are returned by the tree navigation process, there could be an additional requirement that they share a certain minimum number of bytes with the Name of the candidate element before qualifying to be returned as the result of the content-associative lookup (otherwise, the content-associative lookup returns a miss). Another example of a screening requirement could be that, if the tree navigation process terminates without reaching a single Prime Data Element so that multiple Prime Data elements (rooted at the node where the tree navigation terminated) are returned as the outcome of the tree navigation process, then these multiple Prime Data Elements will qualify to be returned as the result of the overall content-associative lookup only if the number of these elements is fewer than a certain specified limit such as 4-16 elements (otherwise, the content-associative lookup returns a miss). Combinations of multiple requirements may be employed to determine the result of the content-associative lookup. If multiple candidates still remain, one could examine Navigation Lookahead bytes and also associated metadata to decide which Prime Data Elements are the most suitable. If still unable to narrow the choice down to a single Prime Data Element, one could furnish multiple Prime Data Elements to the Derive function. In this manner, the lookup process will either report a “miss,” or return a single Prime Data Element, or if not a single Prime Data Element, then a set of Prime Data Elements that are likely to be good starting points for deriving the candidate element.
The tree needs to be designed for efficient content-associative access. A well-balanced tree will provide a comparable depth of access for much of the data. It is expected that the upper few levels of the tree will often be resident in the processor cache, the next few levels in fast memory, and the subsequent levels in flash storage. For very large datasets, it is possible that one or more levels need to reside in flash storage and even disk.
The tree can be laid out so that the 6 levels of the tree can be traversed as follows: 3 levels residing in on-chip cache (containing approximately four thousand “upper level” tree node data structures specifying transitions for links to approximately 256 K nodes), 2 levels in memory (containing 16 million “middle level” tree node data structures specifying transitions for links to 1 billion leaf nodes approximately), and the 6th level in flash storage (accommodating a billion leaf node data structures). The 1 billion leaf node data structures resident at this 6th level of the tree in flash storage furnish the references for the 64 billion Prime Data Elements (on average 64 elements per leaf node).
In the example shown in
In the example shown in
Note that the number of bytes needed on any given link (to differentiate the various subtrees below) will be governed by the actual data in the mix of elements that comprise the dataset. Likewise, the number of links out of a given node will also vary with the data. The self-describing tree node and leaf node data structures will declare the actual number and the values of the bytes needed for each link, as well as the number of links emanating from any node.
Further controls can be placed to limit the amount of cache, memory, and storage devoted at the various levels of the tree, to sort the input into as many differentiated groups as possible, within the allocated budget of incremental storage. To handle situations where there are densities and pockets of data that require very deep subtrees to fully differentiate the elements, such densities could be handled efficiently by grouping a larger set of related elements into a flat group at a certain depth (e.g. the 6th level) of the tree and performing a streamlined search and derivation upon these (by first examining the Navigation Lookahead and metadata to determine the best Prime Data Element, or else (as a fallback) looking only for duplicates rather than the full derivation that is afforded by the method for the rest of the data). This would circumvent the creation of very deep trees. Another alternative is to allow deep trees (with many levels) as long as these levels fit in available memory. The moment the deeper levels spill out to flash or disk, steps can be taken to flatten the tree from that level onwards, to minimize the latency that would otherwise be incurred by multiple successive accesses to deeper levels of tree nodes stored in flash or disk.
It is expected that a relatively small fraction of the total bytes from the Name of the element will often be sufficient to identify each Prime Data Element. Studies performed on a variety of real world datasets using the embodiments described herein confirm that a small subset of the bytes of a Prime Data Element serves to order the majority of the elements to enable the solution. Thus, such a solution is efficient in terms of the amount of storage that it requires for its operation.
In terms of accesses needed for the example from
A candidate element that is a duplicate of a Prime Data Element will need 1 IO to fetch the Prime Data Element in order to verify the duplicate. Once the duplicate is verified, there will be one more IO to update the metadata in the tree. Hence, ingestion of duplicate elements will need two IOs after the tree lookup, for a total of 3 IOs.
A candidate element that fails the tree lookup and is neither a duplicate nor a derivative requires 1 more IO to store the element as a new Prime Data Element in the Sieve, and another IO to update the metadata in the tree. Thus, ingestion of a candidate element that fails the tree lookup will require 2 IOs after the tree lookup, leading to a total of 3 IOs. However, for candidate elements where the tree lookup process terminates without needing a storage IO, a total of only 2 IOs is needed for ingesting such candidate elements.
A candidate element that is a derivative (but not a duplicate) will first need 1 IO to fetch the Prime Data Element needed to compute the derivation. Since it is expected that most often derivations will be off a single Prime Data Element (rather than multiple), only a single IO will be needed to fetch the Prime Data Element. Subsequent to successful completion of the derivation, 1 more IO will be needed to store the Reconstitution Program and the derivation details in the entry created for the element in storage, and another IO to update the metadata in the tree (such as counts, etc.) to reflect the new derivative. Hence, ingestion of a candidate element that becomes a derivative requires 3 additional IOs after the first tree lookup for a total of 4 IOs.
In summary, to ingest a candidate element and apply the Data Distillation™ method to it (while exploiting redundancy globally across a very large dataset) requires approximately 3 to 4 IOs. Compared to what is needed by traditional data deduplication techniques, this is typically just one more IO per candidate element, in return for which redundancy can be exploited globally across the dataset at a grain that is finer than the element itself.
A storage system that offers 250,000 random IO accesses/sec (which means bandwidth of 1 GB/sec of random accesses to pages of 4 KB) could ingest and perform the Data Distillation™ method on about 62,500 input chunks per second (250,000 divided by 4 IOs per input chunk of average size 4 KB each). This enables an ingest rate of 250 MB/sec while using up all the bandwidth of the storage system. If only half of the bandwidth of the storage system is used (so that the other half is available for accesses to the stored data), such a Data Distillation™ system could still deliver ingest rates of 125 MB/sec. Thus, given sufficient processing power, Data Distillation™ systems are able to exploit redundancy globally across the dataset (at a grain that is finer than the element itself) with an economy of IOs and deliver data reduction at ingest rates in the hundreds of megabytes per second on contemporary storage systems.
Thus, as confirmed by the test results, embodiments described herein achieve the complex task of searching for elements (from which an input element can be derived with minimal storage needed to specify the derivation) from a massive store of data with an economy of IO accesses and with minimal incremental storage needed for the apparatus. This framework thus constructed makes it feasible to find elements suitable for derivation using a smaller percentage of the total bytes of the element, leaving the bulk of the bytes available for perturbation and derivation. An important insight that explains why this scheme works effectively for much data is that the tree provides a wieldy, fine-grained structure that allows one to locate the differentiating and distinguishing bytes that identify elements in the Sieve, and although these bytes are each at different depths and positions in the data, they can be isolated and stored efficiently in the tree structure.
Once the difficult problem of finding suitable Prime Data Elements (from which to attempt to derive the candidate element) has been solved, the problem is narrowed down to examining one or a small subset of Prime Data Elements and optimally deriving the candidate element from them with minimum storage needed to specify the derivation. Other objectives include keeping the number of accesses to the storage system to a minimum, and keeping the derivation time and the reconstitution time acceptable.
The Deriver must express the candidate element as the result of transformations performed on the one or more Prime Data Elements, and must specify these transformations as a Reconstitution Program which will be used to regenerate the derivative upon data retrieval. Each derivation may require its own unique program to be constructed. The function of the Deriver is to identify these transformations and create the Reconstitution Program with the smallest footprint. A variety of transformations could be employed, including arithmetic, algebraic, or logical operations performed upon the one or more Prime Data Elements or upon specific fields of each Element. Additionally, one could use byte manipulation transformations, such as the concatenation, insertion, replacement, and deletion of bytes in the one or more Prime Data Elements.
The Deriver could avail of the processing power of the underlying machine and work within the processing budget allocated to it to provide the best analysis possible within the cost-performance constraints of the system. Given that microprocessor cores are more readily available, and given that IO accesses to storage are expensive, the Data Distillation™ solution has been designed to take advantage of the processing power of contemporary microprocessors to efficiently perform local analysis and derivation of the content of the candidate element off a few Prime Data Elements. It is expected that the performance of the Data Distillation™ solution (on very large data) will be rate-limited not by the computational processing but by the IO bandwidth of a typical storage system. For example, it is expected that a couple of microprocessor cores will suffice to perform the required computation and analysis to support ingest rates of several hundred megabytes per second on a typical flash-based storage system supporting 250,000 IOs/sec. Note that two such microprocessor cores from a contemporary microprocessor such as the Intel Xeon Processor E5-2687W (10 cores, 3.1 GHz, 25 MB cache) is a fraction (two of ten) of the total computational power available from the processor.
The size of the information needed to specify the Derivative element (that is derived from the one or more Prime Data Elements) is the sum of the size of the Reconstitution Program and the size of the references needed to specify the required (one or more) Prime Data Elements. The size of the information needed to specify a candidate element as a Derivative element is referred to as the Distance of the candidate from the Prime Data Element. When the candidate can be feasibly derived from any one set of multiple sets of Prime Data Elements, the set of Prime Data Elements with the shortest Distance is chosen as the target.
When the candidate element needs to be derived from more than one Prime Data Element (by assembling extracts derived from each of these), the Deriver needs to factor in the cost of the additional accesses to the storage system and weigh that against the benefit of a smaller Reconstitution Program and a smaller Distance. Once an optimal Reconstitution Program has been created for a candidate, its Distance is compared with the Distance Threshold; if it does not exceed the threshold, the derivation is accepted. Once a derivation is accepted, the candidate element is reformulated as a Derivative Element and replaced by the combination of the Prime Data Element and the Reconstitution Program. The entry in the distilled data created for the candidate element is replaced by the Reconstitution Program plus the one or more references to the relevant Prime Data Elements. If the Distance for the best derivation exceeds the Distance Threshold, the derivative will not be accepted.
In order to yield data reduction, the Distance Threshold must always be less than the size of the candidate element. For example, the Distance Threshold may be set to 50% of the size of the candidate element, so that a derivative will only be accepted if its footprint is less than or equal to half the footprint of the candidate element, thereby ensuring a reduction of 2× or greater for each candidate element for which a suitable derivation exists. The Distance Threshold can be a predetermined percentage or fraction, either based on user-specified input or chosen by the system. The Distance Threshold may be determined by the system based on static or dynamic parameters of the system.
It is evident that to reconstitute an Element from the losslessly reduced form in the Distilled Data to its original unreduced form, only the Prime Data Element(s) and Reconstitution Program specified for the Element need to be fetched. Thus, to reconstitute a given Element, no other Elements need to be accessed or reconstituted. This makes the Data Distillation™ apparatus efficient even when servicing a random sequence of requests for reconstitution and retrieval. Note that traditional methods of compression such as the Lempel Ziv method need to fetch and decompress the entire window of data containing a desired block. For example, if a storage system employs the Lempel-Ziv method to compress 4 KB blocks of data using a window of 32 KB, then to fetch and decompress a given 4 KB block, the entire window of 32 KB needs to be fetched and decompressed. This imposes a performance penalty because more bandwidth is consumed and more data needs to be decompressed in order to deliver the desired data. The Data Distillation™ apparatus does not incur such a penalty.
The Data Distillation™ apparatus can be integrated into computer systems in a variety of ways to organize and store data in a manner that efficiently uncovers and exploits redundancy globally across the entire data in the system.
A file system (or filesystem) associates a file (e.g., a text document, a spreadsheet, an executable, a multimedia file, etc.) with an identifier (e.g., a filename, a file handle, etc.), and enables operations (e.g., read, write, insert, append, delete, etc.) to be performed on the file by using the identifier associated with the file. The namespace implemented by a file system can be flat or hierarchical. Additionally, the namespace can be layered, e.g., a top-layer identifier may be resolved into one or more identifiers at successively lower layers until the top-layer identifier is completely resolved. In this manner, a file system provides an abstraction of the physical data storage device(s) and/or storage media (e.g., computer memories, flash drives, disk drives, network storage devices, CD-ROMs, DVDs, etc.) that physically store the contents of the file.
The physical storage devices and/or storage media that are used for storing information in a file system may use one or multiple storage technologies, and can be located at the same network location or can be distributed across different network locations. Given an identifier associated with a file and one or more operation(s) that are requested to be performed on the file, a file system can (1) identify one or more physical storage devices and/or storage media, and (2) cause the physical storage devices and/or storage media that were identified by the file system to effectuate the operation that was requested to be performed on the file associated with the identifier.
Whenever a read or a write operation is performed in the system, different software and/or hardware components may be involved. The term “Reader” can refer to a collection of software and/or hardware components in a system that are involved when a given read operation is performed in the system, and the term “Writer” can refer to a collection of software and/or hardware components in a system that are involved when a given write operation is performed in the system. Some embodiments of the methods and apparatuses for data reduction described herein can be utilized by or incorporated into one or more software and/or hardware components of a system that are involved when a given read or write operation is performed. Different Readers and Writers may utilize or incorporate different data reduction implementations. However, each Writer that utilizes or incorporates a particular data reduction implementation will correspond to a Reader that also utilizes or incorporates the same data reduction implementation. Note that some read and write operations that are performed in the system may not utilize or incorporate the data reduction apparatus. For example, when Data Distillation™ Apparatus or Data Reduction Apparatus 103 retrieves Prime Data Elements or adds new Prime Data Elements to the Prime Data Store, it can perform the read and write operations directly without data reduction.
Specifically, in
Implementation examples for
We now discuss the use of the Data Distillation™ apparatus in Wide Area Network installations where workgroups collaboratively share data that is spread across multiple nodes. When data is first created, it can be reduced and communicated as illustrated in
In such an installation, any modifications to the data at any given site need to be communicated to all other sites, so that the Prime Data Sieve at each site is kept consistent. Hence, as shown in
Thus, the Data Distillation™ apparatus can be used to reduce the footprint of data stored across the various sites of a Wide Area Network as well as make efficient use of the communication links of the network.
Mapper 1207 itself has two components within it, namely, the set of tree node data structures and the set of leaf node data structures that define the overall tree. The set of tree node data structures could be placed into one or more files. Likewise the set of leaf node data structures could be placed into one or more files. In some embodiments, a single file called the Tree Nodes File holds the entire set of tree node data structures for the tree created for the Prime Data Elements for the given dataset (Input Files 1201), and another single file called the Leaf Nodes File holds the entire set of leaf node data structures for the tree created for the Prime Data Elements for that dataset.
In
The tree nodes in the Tree Nodes File will contain references to other tree nodes within the Tree Nodes File. The deepest layer (or lowermost levels) of tree nodes in the Tree Nodes File will contain references to entries in leaf node data structures in the Leaf Nodes File. Entries in the leaf node data structures in the Leaf Nodes File will contain references to Prime Data Elements in the PDE File.
The Tree Nodes File, Leaf Nodes File and PDE File are illustrated in
Note that
The Data Distillation™ apparatus shown in
The Data Distillation™ apparatus can be configured to exploit redundancy across a subset of the Input Dataset and deliver lossless reduction for each subset of data presented to it. For example, as shown in
While the Data Distillation™ apparatus is by design already efficient at exploiting redundancy across the global scope of data, the above technique may be used to further speed up the data reduction process and further improve its efficiency. The throughput of the data reduction process can be increased by limiting the size of a Data Lot to be able to fit into the available memory of a system. For example, an Input Dataset which is many terabytes or even petabytes in size could be broken up into numerous Data Lots each of size say 256 GB, and each Data Lot can be speedily reduced. Using a single processor core (Intel Xeon E5-1650 V3, Haswell 3.5 Ghz processor) with 256 GB of memory, such a solution exploiting redundancy across a scope of 256 GB has been implemented in our labs to yield ingest rates of several hundred megabytes per second of data while delivering reduction levels of 2-3× on various datasets. Note that a scope of 256 GB is many million-fold larger than 32 KB, which is the size of the window at which the Lempel Ziv method delivers ingest performance of between 10 MB/sec to 200 MB/sec on modern processors. Thus, by limiting the scope of redundancy appropriately, improvements in the speed of the data distillation process can be achieved by potentially sacrificing some reduction.
Further refinements exist that can improve the efficiency of the apparatus and speedup the reconstitution process. In some embodiments, a single unified Mapper is used for the distillation, but instead of holding the prime data elements in a single PDE File, the prime data elements are held across N PDE FILES. Thus, the erstwhile single PDE FILE is partitioned into n PDE Files, each smaller than a certain threshold size, and each partition being created during the distillation process when the PDE File exceeds that threshold size (upon growth due to installations of prime data elements). Distillation proceeds for each input file by consulting the Mapper to content-associatively select the appropriate prime data element suitable for derivation, and subsequently deriving off the appropriate prime data element that is fetched from the appropriate PDE File where it resides. Each distilled file is further enhanced to list out all those PDE Files (out of the n PDE Files) which contain prime data elements to which the particular distilled file makes references. In order to reconstitute the particular distilled file, only those listed PDE files will need to be loaded up or opened to be accessed for the reconstitution. This has the advantage that for reconstitution of a single distilled file or a few distilled files, only those PDE Files containing the prime data elements needed for that particular distilled file need to be accessed or kept active, while the other PDE Files need not be retained or loaded into fast tiers of memory or storage. Thus, reconstitution can be sped up and made more efficient.
The partitioning of the PDE File into n PDE Files can be further guided by criteria that localizes the reference patterns made to the prime data during the reduction of any given file in the dataset. The apparatus can be enhanced with counters that count and estimate the density of references to elements in the current PDE File. If this density is high, the PDE File will not be partitioned or split, and will keep growing upon subsequent installation of elements. Once the density of references from a given distilled file tapers down, the PDE File can be allowed to be split and partitioned upon subsequent growth beyond a certain threshold. Once partitioned, a fresh PDE File will be opened, and subsequent installations from subsequent distillation will be made into this fresh PDE File. This arrangement will further speed up reconstitution in the event that only a subset of files from a Data Lot need to be reconstituted.
In some embodiments, the Data Distillation apparatus may be enhanced to further improve the memory efficiency and performance efficiency of reconstitution and retrieval. During the distillation process, metadata is collected for each prime data element and stored as part of the reduced data footprint. This metadata identifies those prime data elements from which either duplicates or derivatives are created during the distillation process. During reconstitution, it is only these prime data elements that need be retained in memory and only until such time that all the elements which are either duplicates of these prime data elements or derivatives that derive off these prime data elements have been reconstituted. The metadata further provides, for each prime data element, a sum of the number of duplicates and derivatives that derive off this prime data element. This sum will be referred to as the Re-use Count, which indicates the number of times this prime data element will be used for the purpose of reconstituting either a duplicate or a derivative. At the start of reconstitution of the dataset, this metadata is first fetched into memory. As reconstitution progresses, prime data elements are fetched from storage and used to reconstitute elements and deliver them into the reconstituted dataset. As they are fetched in, those prime data elements that have a non-zero Re-use Count (as identified by the metadata) are allocated into memory so that they may be retained and made readily available (without incurring storage IOs) for use for reconstituting elements which are either duplicates or derivatives of them. As reconstitution progresses, the Re-use Count for a given prime data element is decremented upon reconstitution of each element that is either a duplicate or a derivative of the given prime data element. Upon reconstitution of an element which is either a duplicate or derivative of one or more prime data elements, the Re-use Count of each of these prime data elements will be decremented. Once the Re-use Count for a given prime data element goes to zero, the prime data element need no longer be retained in memory, thus reducing the memory required to hold prime data elements during reconstitution. The memory may be organized as a cache, and a prime data element whose Re-use Count goes to zero may be de-allocated from the cache. The Re-use Count of prime data elements provides information about the lifetime during which they may be re-used for reconstitution of duplicates or derivatives, and hence retained in memory for that duration. The metadata retained at the end of the Distillation process for the purposes of describing the lifetime of prime data elements used for duplicates or derivation will be referred to as PDE Re-use and Lifetime Metadata.
The PDE Re-use and Lifetime Metadata can be further enhanced to include the size of each of the prime data elements from which either duplicates or derivatives are created during the distillation process. The sum of the sizes of all these prime data elements provides an upper bound to the amount of memory that is needed to hold prime data elements during reconstitution, and will be referred to as the Coarse Grain Working Set of prime data elements for the given dataset.
Note that after the distillation process, it is possible that there are prime data elements in the reduced data that have not been used for creating duplicates or derivatives. If the Re-use Count was generated for such prime data elements, it would be zero. During reconstitution, after such prime data elements have been fetched from storage, and transmitted to the reconstituted output, such elements do not need to be retained in memory because they are never referenced again for the purposes of reconstituting duplicates or derivatives. In such a situation, the Coarse Grain Working Set of prime data elements for the dataset will be smaller than the sum of the sizes of all prime data elements in the dataset. Thus, during reconstitution, only retaining the Coarse Grain Working Set in memory leads to a savings in memory compared to retaining all prime data elements in memory.
In practice, the actual amount of memory required to hold the prime data elements that are needed for reconstitution of derivatives during reconstitution could be even smaller than the Coarse Grain Working Set. This is because it is possible that some prime data elements may dynamically get de-allocated from the memory (on account of their Re-Use Count being decremented to zero) before other prime data elements are referenced for the first time during reconstitution and therefore arranged to be fetched and allocated into memory (and then used for subsequent reconstitution of derivatives). Since both these categories of prime data elements need not reside in the memory at the same time, the actual amount of memory needed during reconstitution will be smaller than the Coarse Grain Working Set.
The above arrangement enables more efficient use of the memory needed to hold prime data elements during reconstitution of a dataset, while still eliminating random accesses to storage during reconstitution. Once again, the load of the PDE File from storage can be a set of accesses for a sequentially contiguous chunk of bytes, the read of each Distilled File can also be a set of accesses for a sequentially contiguous chunk of bytes, and lastly each reconstituted input file can be written out to storage as a set of accesses for a sequentially contiguous chunk of bytes. The storage performance of this arrangement more closely tracks the performance of reading and writing sequentially contiguous chunks of bytes rather than the performance of a solution which incurs multiple random storage accesses.
The above enhancement can be made to any of the Distillation apparatii illustrated in
In some embodiments, rather than allocating memory for reconstitution based on the Coarse Grain Working Set size of a dataset, the Data Distillation apparatus may be further enhanced to help determine a tighter bound on the amount of memory needed to hold prime data elements during reconstitution. This bound will be referred to as the Fine Grain Working Set and is determined by examining the creation and usage of all the prime data elements and their spatial and temporal characteristics across the entire distillation task. Additional book-keeping and more fine-grain analysis may be undertaken during distillation to compute this tighter bound. This is described below.
Consider an input dataset (of a given size and spanning a certain range of addresses) which is factorized into N elements. Distillation of this dataset entails N distillation events—each event converts an input element into either a prime data element or a derivative (or duplicate) of a previously created prime data element. Conversely, reconstitution of the entire dataset will entail N reconstitution events, each event either furnishing a prime data element as the output data element, or executing a reconstitution program on a prime data element to deliver the output element. The N events comprise a sequence, and the lifetime of prime data elements may be described in terms of these events in the sequence both during distillation and reconstitution. Each event may be associated with the address of the element in the original input data stream that is being distilled (and this is also the address of the same element in the output stream when it is regenerated by the reconstitution). This address is referred to as the locator address for the event. The locator address for each event may be retained through the distillation process and made available at the end of the distillation process for an inventory step to help determine the Fine Grain Working Set.
During the distillation process, the distillation apparatus is further enhanced to record for each prime data element the event where the prime data element is first created and also the event where it is last used for the creation of duplicates or derivatives. These two events effectively define the lifetime of this prime data element. During reconstitution, this prime data element need be retained in memory only for this lifetime. During reconstitution, after the end of the lifetime of any prime data element, viz. after the event of last use, that prime data element is no longer needed for any remaining reconstitution and may be de-allocated from the memory kept aside for reconstitution. For each prime data element, the “first created” and “last used” fields will store or record the locator address of the event where the prime data element was first created or last used respectively.
At each step of the distillation process (from the 1st distillation event through the last or Nth distillation event) the record of “first created” and/or “last used” fields for the one or more prime data elements involved in the event is updated. By the end of the distillation process, the record of the “first created” and “last used” entries is complete for all prime data elements. After distillation completes, a fresh inventory analysis is performed by stepping through the locator addresses of each distillation event and determining the memory needed to hold prime data elements subsequent to each event. At any given event, the Dynamic Set of prime data elements that need to be retained in memory for subsequent distillation is calculated as the difference between all the “Prime data elements created thus far” and all the “Prime data elements retired thus far”. The “Prime data elements created thus far” is computed as the sum of the sizes of all prime data elements whose “first created” value is less than or equal to the address locator value for this event. The “Prime data elements retired thus far” is computed as the sum of the sizes of all prime data elements whose “last used” value is less than (precedes) the address locator value for this event. The Dynamic Set is calculated for each distillation event. The largest value of the Dynamic Set across all the events is defined to be the Fine Grain Working Set. This is the upper bound on the size of memory that needs to be allocated during reconstitution to ensure that all the prime data elements that are needed during reconstitution are dynamically available in memory without the need for incurring any random IOs to fetch them.
In some embodiments, rather than performing the Dynamic Set calculation at each event (and therefore at each locator address), the calculation may be performed for a number of consecutive events or a range of locator addresses all at once. This will reduce the amount of calculation and iteration needed during the inventory step. Although this will create a more conservative estimate of the Fine Grain Working Set, it will have a smaller impact on the performance. In practical real-world situations this estimate is found to approach the actual Fine Grain Working Set, and hence one can harvest the benefit of using the smaller amount of memory (that is demanded by the actual dynamic working set requirements of the dataset) while incurring negligible performance loss on account of having to compute the Fine Grain Working Set estimate at the end of the Distillation process.
It can be seen that for the example illustrated in
The Fine Grain Working Set that is computed at the end of the inventory step following Distillation of a dataset may be recorded and stored in the Reduced Data. Prior to invoking Reconstitution, the Reduced Data may be queried by the user to procure the value of the Fine Grain Working Set for this Reduced Data. The user may then invoke Reconstitution on the Reduced Data, and furnish the task with memory (specified via a Reconstitution Memory Budget control) equal to or greater than the Fine Grain Working Set.
The apparatus described so far enables a more efficient use of memory during reconstitution by de-allocating prime data elements from memory when their Re-use count is exhausted. By computing the Fine Grain Working Set during distillation, a tighter bound can be procured for the memory needed during reconstitution, and after querying the Reduced data to discover this value, the Reconstitution Memory Budget can be set accordingly. Further improvements can be made to enable the user to specify a priori (at the start of the distillation) a “Projected Reconstitution Memory Budget” to act as a bound to constrain the distillation process, such that during reconstitution the Fine Grain Working Set never exceeds this Projected Reconstitution Memory Budget. Given such a constraint, rather than execute the inventory step only at the end of distillation of the dataset to simply discover the Fine Grain Working Set, the distillation process is enhanced to perform the inventory step repeatedly at intervals during the distillation process, and to compute a conservative estimate of the Fine Grain Working Set up to the point of the interval. The moment this estimate of the Fine Grain Working Set approaches the Projected Reconstitution Memory Budget, rather than continue with distillation for the next interval, the Data Lot is closed and a fresh scope opened up for subsequent distillation. In this scenario, the estimate for the Fine Grain Working Set is conservative and could lead to a larger value than the actual Fine Grain Working Set. This is because the entire distillation process has not completed at the point in time when each estimation is invoked. In this approach, given a Projected Reconstitution Memory Budget, distillation could first be allowed to proceed until such point that the sum of all the prime data elements (the size of the PDE Store) reaches the Projected Reconstitution Memory Budget. The inventory and estimation step can now be invoked for the first time. The estimate could count into the Estimated Fine Grain Working Set the sum of all the prime data elements which have a re-use count of non-zero—at this point it is not known which elements have been used for the last time and hence one cannot subtract the “last used” elements. Elements with a zero re-use count could be removed from the estimate. The different between the Projected Reconstitution Memory Budget and the Estimated Fine Grain Working Set is noted, and distillation can be allowed to proceed to ingest as many bytes of input as this difference, after which the estimation is invoked again. The reason for this reduced interval is that each of the incoming elements could lead to an increase of the re-use count of existing elements with re-use count of 0 and could bring them into the Fine Grain Working Set estimate. Once again the estimation can be invoked and the same process repeated. Once the interval of forward progress becomes smaller than a certain threshold, the Data Lot (the scope across which redundancy is exploited) can be closed, and further distillation of the dataset will proceed by opening a fresh Data Lot.
In this manner, distillation may be constrained such that the actual memory needed to hold prime data elements for reconstitution never exceeds a pre-decided memory budget. Before reconstitution, one will no longer need to query the Reduced Data to procure the actual Fine Grain Working Set, but rather can simply set the Reconstitution Memory Budget to the value of the Projected Reconstitution Memory Budget that was set at Distillation. This method might be employed in environments where there is less control on how reconstitution is performed, and where it might be uncertain whether users will query the Reduced data to determine the value of the Fine Grain Working Set. This method, of specifying a Projected Reconstitution Memory Budget, does indeed lead to a conservative estimate of the Fine Grain Working Set, and might often lead to a premature closure of the Data Lot, when the budget is about to be exceeded. An alternative approach is to not employ the Projected Reconstitution Memory Budget at distillation, and simply specify the Reconstitution Memory Budget at Reconstitution. When the amount of memory used by reconstitution reaches the budget, the implementation can switch to treating the memory like a cache—to make room for incoming prime data elements, the implementation can make a decision on which existing prime data elements to de-allocate, thereby incurring occasional random IOs when these are brought back from storage when needed. The incremental loss in performance due to these IOs could be traded off vs the benefit of being able to perform distillation across a larger scope.
Note that the preceding descriptions use memory as the fastest storage tier available to reconstitution and describe how to speedup reconstitution by retaining the prime data elements in memory. Note that the features above may be utilized to hold the prime data elements in the fastest available tier of storage, even if that tier is not DRAM.
Note that
The distributed computing paradigm entails distributed processing of large datasets by programs running on multiple computers.
The Data Distillation Process can be executed in a distributed fashion across the multiple nodes of a distributed computing cluster to harness the total compute, memory, and storage capacity of the numerous computers in the cluster. In this setup, a master distillation module on the master node interacts with slave distillation modules running on slave nodes to achieve the data distillation in a distributed fashion. To facilitate this distribution, the Prime Data Sieve of the apparatus can be partitioned into multiple independent subsets or subtrees that can be distributed across multiple slave modules running on the slave nodes. Recall that in the Data Distillation Apparatus, the Prime Data Elements are organized in tree form based upon their Names, and their Names are derived from their content. The Prime Data Sieve can be partitioned into multiple independent subsets or Child Sieves based on the leading bytes of the Name of Elements in the Prime Data Sieve. There can be multiple ways to partition the Name space across multiple subtrees. For example, the values of the leading bytes of the Name of elements can be partitioned into a number of subranges, and each subrange assigned to a Child Sieve. There can be as many subsets or partitions created as there are slave modules in the cluster, so each independent partition is deployed on a particular slave module. Using the deployed Child Sieve, each slave module is designed to execute the data distillation process on candidate elements that it receives.
In this setup, the master module running on the master node receives an Input File and performs a lightweight parsing and factorization of the Input File to break the Input File into a sequence of candidate elements, and subsequently steer each candidate element to a suitable slave module for further processing. The lightweight parsing might include parsing each candidate element against a schema, or might include the application of fingerprinting on the candidate element to determine the dimensions that constitute the leading bytes of the Name of the candidate element. The parsing at the master is limited to identify only as many bytes as is sufficient to determine which slave module should receive the candidate element. Based upon the value in the leading bytes of the Name of the candidate element, the candidate is forwarded to the slave module at the slave node which holds the Child-Sieve that corresponds to this specific value.
As data accumulates into the Sieve, the partition can be intermittently revisited and rebalanced. The partitioning and rebalancing functions can be performed by the master module.
Upon receiving a candidate element, each slave module executes the Data Distillation process, starting with a complete parsing and examination of the candidate element to create its Name. Using this Name, the slave module performs a content associative lookup of the Child Sieve, and executes the distillation process to convert the candidate element into an Element in the losslessly reduced representation with respect to that Child Sieve. The losslessly reduced representation of an Element in the Distilled File is enhanced with a field called SlaveNumber to identify the slave module and corresponding Child Sieve with respect to which the Element has been reduced. The losslessly reduced representation of the Element is sent back to the master module. If the candidate element is not found in the Child Sieve or cannot be derived from Prime Data Elements in the Child Sieve, a fresh Prime Data Element is identified to be allocated into the Child Sieve.
The master module continues to steer all candidate elements from an Input File to appropriate slave modules and accumulates the incoming Element descriptions (in losslessly reduced representation) until it has received all Elements for the Input File. At that point a global commit communication can be issued to all slave modules to update their respective Child Sieves with the outcome of their individual distillation processes. The Distilled File for the input is stored at the master module.
In some embodiments, rather than wait for the entire Distilled File to be prepared before any slave can update its Child Sieve with either fresh Prime Data Elements or metadata, the updates to the Child Sieves may be completed as the candidate elements get processed at the slave modules.
In some embodiments, each Child Sieve contains Prime Data Elements as well as Reconstitution Programs in accordance with the descriptions for
Data retrieval is similarly co-ordinated by the master module. The master module receives a Distilled File and examines the losslessly reduced specification for each Element in the Distilled File. It extracts the field “SlaveNumber” that indicates which slave module will reconstitute the Element. The Element is then sent to the appropriate slave module for reconstitution. The Reconstituted Element is then sent back to the master module. The master module assembles Reconstituted Elements from all the slaves and forwards the Reconstituted file to the consumer that is demanding the file.
Note that by employing the arrangements described in
To further boost the overall throughput, the task of lightweight parsing and factorization of the input stream to identify which Child_Sieve should receive the candidate element can be parallelized. This task can be partitioned by the master module into multiple concurrent tasks to be executed in parallel by the slave modules running on the multiple slave nodes. This can be accomplished by looking ahead in the data stream and slicing the data stream into multiple partially overlapping segments. These segments are sent by the master to each of the slave modules which perform the lightweight parsing and factorization in parallel and send back the results of the factorization to the master. The master resolves the factorization across the boundaries of each of the segments and then routes the candidate elements to the appropriate slave module.
As shown in
In this arrangement, retrieval of data can be satisfied at any slave module. The module that receives the retrieval request needs to first determine where the Distilled File for that requested File resides, and fetch the Distilled File from the corresponding slave module. Subsequently, the initiating slave module needs to co-ordinate the distributed reconstitution of the various elements in that Distilled File to yield the original File and deliver it to the requesting application.
In this fashion, the Data Distillation Process can be executed in a distributed manner across multiple nodes of a distributed system to more effectively harness the total compute, memory, and storage capacity of the numerous computers in the cluster. All nodes in the system can be utilized to ingest and retrieve data. This should enable very high rates of data ingestion and retrieval while taking full advantage of the total combined storage capacity of the nodes in the system. This also allows applications running on any node in the system to make a query at a local node for any data stored anywhere in the system, and to have that query satisfied efficiently and seamlessly.
In the arrangements described in
Thus, employing the numerous techniques described above, an efficient use is made of the resources in the distributed system to perform data distillation on very large datasets at very high speeds.
The Data Distillation™ method and apparatus can be further enhanced to facilitate efficient movement and migration of data. In some embodiments, the losslessly reduced dataset can be delivered in the form of multiple containers or Parcels to facilitate data movement. In some embodiments, one or more reduced Data Lots could fit into a single container or Parcel, and alternatively, a single reduced Data Lot can be converted into multiple Parcels. In some embodiments, a single reduced Data Lot is delivered as a single self-describing Parcel.
In this manner the Losslessly reduced representation of the Data Lot is delivered as a Parcel in a format that is self-describing and that is suitable for the movement and relocation of the data.
Data reduction was performed on a variety of real world datasets using the embodiments described herein to determine the effectiveness of these embodiments. The real world datasets studied include the Enron Corpus of corporate email, various U.S. Government records and documents, U.S. Department of Transportation records entered into the MongoDB NOSQL database, and corporate PowerPoint presentations available to the public. Using the embodiments described herein, and factorizing the input data into variable-sized elements (with boundaries determined by fingerprinting) averaging 4 KB, an average data reduction of 3.23× was achieved across these datasets. A reduction of 3.23× implies that the size of the reduced data is equal to the size of the original data divided by 3.23×, leading to a reduced footprint with a compression ratio of 31%. Traditional data deduplication techniques were found to deliver a data reduction of 1.487× on these datasets using equivalent parameters. Using the embodiments described herein, and factorizing the input data into fixed-sized elements of 4 KB, an average data reduction of 1.86× was achieved across these datasets. Traditional data deduplication techniques were found to deliver a data reduction of 1.08× on these datasets using equivalent parameters. Hence, the Data Distillation™ solution was found to deliver significantly better data reduction than traditional data deduplication solutions.
The test runs also confirm that a small subset of the bytes of a Prime Data Element serve to order the majority of the elements in the Sieve, thus enabling a solution that requires minimal incremental storage for its operation.
The results confirm that the Data Distillation™ apparatus efficiently enables exploiting redundancy among data elements globally across the entire dataset, at a grain that is finer than the element itself. The lossless data reduction delivered by this method is achieved with an economy of data accesses and IOs, employing data structures that themselves require minimal incremental storage, and using a fraction of the total computational processing power that is available on modern multicore microprocessors. Embodiments described in the preceding sections feature systems and techniques that perform lossless data reduction on large and extremely large datasets while providing high rates of data ingestion and data retrieval, and that do not suffer from the drawbacks and limitations of conventional techniques.
Performing Content Associative Search and Retrieval on Data that has Been Losslessly Reduced by Deriving Data from Prime Data Elements Resident in a Prime Data Sieve
The Data Distillation Apparatus described in the preceding text and illustrated in
The Reverse Reference or Reverse Link is designed as a universal handle which can reach all Elements in all Distilled Files that share the Prime Data Sieve.
The addition of the Reverse References is not expected to significantly impact the data reduction achieved, since data element size is expected to be chosen such that each reference is a fraction of the size of the data element. For example, consider a system where Derivative Elements are constrained to each derive off no more than 1 Prime Data Element (so multi-element derivatives are not allowed). The total number of Reverse References across all Leaf Node Data Structures will equal the total number of Elements across all Distilled Files. Assume the sample input dataset of 32 GB size is reduced to 8 GB of losslessly reduced data, employing an average element size of 1 KB, and yielding a reduction ratio of 4×. There are 32M elements in the input data. If each Reverse Reference is 8 B in size, the total space occupied by the Reverse References is 256 MB, or 0.25 GB. This is a small increase to the 8 GB footprint of the reduced data. The new footprint will be 8.25 GB and the effective reduction achieved will be 3.88×, which represents a loss of reduction of 3%. This is a small price to pay for the benefits of powerful content associative data retrieval on the reduced data.
As described earlier in this document, the Distillation Apparatus can employ a variety of methods to determine the locations of the various components of the Skeletal Data Structure within the content of a candidate element. The various components of the Skeletal Data Structure of the element can be considered as Dimensions, so that a concatenation of these Dimensions followed by the rest of the content of each element is used to create the Name of each element. The Name is used to order and organize the Prime Data Elements in the tree.
In usage models where the structure of the input data is known, a schema defines the various fields or Dimensions. Such a schema is furnished by the Analytics Application that is using this Content Associative Data Retrieval Apparatus and is provided to the apparatus through an interface to the application. Based upon the declarations in the schema, the Parser of the Distillation Apparatus is able to parse the content of a candidate element to detect and locate the various Dimensions and create the Name of the candidate element. As described earlier, Elements that have the same content in the fields corresponding to the Dimensions will be grouped together along the same leg of the tree. For each Prime Data Element installed into the Sieve, the information on the Dimensions can be stored as metadata in the entry for that Prime Data Element in the Leaf Node Data Structure. This information can include the locations, sizes, and values of content at each of the declared Dimensions and can be stored in the field referred to in
The schemas furnished by the application driving this apparatus may specify a number of Primary Dimensions as well as a number of Secondary Dimensions. Information for all of these Primary and Secondary Dimensions can be retained in the metadata in the Leaf Node Data Structure. The Primary Dimensions are used to form the principal axis along which to sort and organize the elements in the Sieve. If Primary Dimensions are exhausted and subtrees with large membership still remain, then Secondary Dimensions may also be used deeper down the tree to further subdivide the elements into smaller groups. Information on the Secondary Dimensions can be retained as metadata and also used as secondary criteria to differentiate the elements within a leaf node. In some embodiments offering content associative multidimensional search and retrieval, a requirement may be placed that all incoming data must contain the keys and valid values for each of the Dimensions declared by the schema. This allows the system a way to ensure that only valid data enters the desired subtrees in the Sieve. Candidate elements which either do not contain all fields specified as Dimensions or which contain invalid values in the values corresponding to the fields for the Dimensions will be sent down a different subtree as illustrated earlier in
The Data Distillation apparatus is constrained in one additional way in order to comprehensively support content associative search and retrieval of data based upon the content in the Dimensions. When Derivative Elements are created off a Prime Data Element, the Deriver is constrained to ensure that both the Prime Data Element and the Derivative have the exact same content in the value fields for each of the corresponding Dimensions. Thus, when a derivative is being created, the Reconstitution Program is not allowed to perturb or modify the content in the value fields corresponding to any of the Dimensions of the Prime Data Element, in order to construct the Derivative Element. Given a candidate element, during lookup of the Sieve, if the candidate element has different content in any of the Dimensions compared to the corresponding Dimensions of the target Prime Data Element, a fresh Prime Data Element needs to be installed, rather than accept the derivative. For example, if a subset of the Primary Dimensions sufficiently sort the elements into distinct groups in the tree so that a candidate element arrives at a leaf node to find a Prime Data Element that has the same content in this subset of Primary Dimensions but different content in either the remaining Primary Dimensions or the Secondary Dimensions, then, instead of creating a derivative, a fresh Prime Data Element needs to be installed. This feature ensures that all data can be searched using the Dimensions by simply querying the Prime Data Sieve.
The Deriver may employ a variety of implementation techniques to enforce the constraint that the Candidate Element and the Prime Data Element must have the exact same content in the value fields for each of the corresponding Dimensions. The Deriver may extract information comprising the locations, lengths and content of the fields corresponding to the Dimensions from the Skeletal Data Structure of the Prime Data Element. Similarly, this information is received from the Parser/Factorizer or computed for the Candidate Element. Next the corresponding fields for the Dimensions from the candidate Element and the Prime Data Element can be compared for equality. Once confirmed to be equal, the Deriver may proceed with the rest of the Derivation. If there is no equality, the Candidate Element is installed in the Sieve as a fresh Prime Data Element.
The restrictions described above are not expected to significantly hamper the degree of data reduction for most usage models. For example, if input data is comprised of a set of Elements which are data warehouse transactions of size 1000 bytes each, and if a set of 6 Primary Dimensions and 14 Secondary Dimensions are specified by the schema, each with say 8 bytes of data per Dimension, the total bytes occupied by content at the Dimensions is 160 bytes. No perturbations are allowed on these 160 bytes when creating a derivative. This would still leave the remaining 840 bytes of candidate element data available for perturbation to create derivatives, thus leaving ample opportunity for exploitation of redundancy, while simultaneously enabling the data from the data warehouse to be searched and retrieved in a content associative manner using the Dimensions.
To execute a search query for data containing specific values for fields in the Dimensions, the apparatus can traverse the tree and reach a node in the tree that matches the Dimensions specified, and all Leaf Node Data structures below that node can be returned as the result of the lookup. References to Prime Data Elements present at the Leaf Node can be used to fetch the desired Prime Data Elements if required. The Reverse Links enable retrieval of the input Element (in losslessly reduced format) from the Distilled File, if so desired. The Element can subsequently be reconstituted to yield the original input data. Thus, the enhanced apparatus allows all the searching to be done on data in the Prime Data Sieve (which is a smaller subset of the total data) while yet being able to reach and retrieve all derivative elements as needed.
The apparatus as enhanced can be used to execute search and lookup Queries for powerful searches and retrieval of relevant subsets of data based upon the content in Dimensions specified by the query. A Content Associative Data Retrieval Query will have the form “Fetch (Dimension 1, value of Dimension 1; Dimension 2, Value of Dimension 2; . . . ). The Query will specify the Dimensions involved in the search as well as the values to be used for each of the specified Dimensions for content associative search and lookup. A query may specify all the Dimensions or it may specify only a subset of the Dimensions. The Queries may specify compound conditions based on multiple dimensions as the criteria for the search and retrieval. All data in the Sieve which has the specified values for the specified Dimensions will be retrieved.
A variety of Fetch queries can be supported and made available to the Analytics Application that is using this Content Associative Data Retrieval Apparatus. Such queries will be furnished to the apparatus through an interface from the application. The interface provides queries from the application to the apparatus and returns results of queries from the apparatus to the application. Firstly, a query FetchRefs can be used to fetch a reference or Handle to the Leaf Node Data Structure in
In addition to such multidimensional content associative Fetch primitives, the interface may also provide to the application the capability to directly access Prime Data Elements (using the Reference to the Prime Data Element) and Elements in the Distilled File (using the Reverse Reference to the Element). Additionally, the interface may provide to the application the capability to Reconstitute a Distilled Element in the Distilled File (given a Reference to the Distilled Element) and deliver the Element as it existed in the Input Data.
A judicious combination of these queries can be used by an Analytics application to perform searches, determine relevant unions and intersections, and glean important insights.
Alternatively, such an auxiliary tree may be based on Secondary Dimensions and used to aid in rapid convergence of searches using the Dimensions.
Examples of Queries executed on the apparatus shown in
The apparatus described thus far can be used to support searches based upon content that is specified in fields called Dimensions. Additionally, the apparatus can be used to support searches based upon a listing of keywords that are not included in the listing of Dimensions. Such keywords may be provided to the apparatus by an application such as a search engine that is driving the apparatus. The keywords may be specified to the apparatus via a schema declaration or passed in via a keyword list containing all the keywords, each separated by a declared separator (such as spaces, or commas, or linefeeds). Alternatively, both a schema as well as a keyword list may be used to collectively specify all the keywords. A very large number of keywords may be specified—the apparatus places no limit on the number of keywords. These search keywords will be referred to as Keywords. The apparatus can maintain an inverted index for search using these Keywords. The inverted index contains for each Keyword a listing of Reverse References to Elements in the Distilled Files that contain this Keyword.
Based upon the Keyword declarations in the schema or the Keyword list, the Parser of the Distillation Apparatus can parse the content of a candidate element to detect and locate the various Keywords (if and where found) in the incoming candidate element. Subsequently, the candidate element is converted into either a Prime Data Element or Derivative Element by the Data Distillation Apparatus and placed as an Element in the Distilled File. The inverted index for the Keywords that were found in this Element can be updated with Reverse References to this Element in the Distilled File. For each keyword found in the Element, the inverted index is updated to include a Reverse Reference to this Element in the Distilled File. Recall that Elements in the Distilled File are in the losslessly reduced representation.
Upon a search Query of the data using a Keyword, the inverted index is consulted to find and extract Reverse References to Elements in the Distilled File that contain this Keyword. Using the Reverse Reference to such an Element, the losslessly reduced representation of the Element can be retrieved, and the Element can be reconstituted. The Reconstituted Element can then be provided as the result of the search Query.
The inverted index can be enhanced to contain information which locates the offset of the Keyword in the Reconstituted Element. Note that the offset or location of each Keyword detected in the candidate element can be determined by the Parser and hence this information can also be recorded in the inverted index when the Reverse Reference to the Element in the Distilled File is placed into the inverted index. Upon a search Query, after the inverted index is consulted to retrieve a Reverse Reference to an Element in the Distilled File that contains the relevant Keyword, and after the Element is reconstituted, the recorded offset or location of the Keyword in the Reconstituted Element (same as the original input candidate element) can be used to pinpoint where the Keyword exists in the Input data or Input File.
Dimensions and Keywords have different implications to the Prime Data Sieve in the Data Distillation Apparatus. Note that the Dimensions are used as the principal axes along which to organize Prime Data Elements in the Sieve. Dimensions form the Skeletal Data Structure of each Element in the data. The Dimensions are declared based upon knowledge of the structure of the incoming data. The Deriver is constrained such that any Derivative Element that is created must have the exact same content as the Prime Data Element in the values of the fields for each of the corresponding Dimensions.
These properties need not hold for the Keywords. In some embodiments, neither is there an a priori requirement that the Keywords even exist in the data, nor is the Prime Data Sieve required to be organized based on the Keywords, and nor is the Deriver constrained with regards to derivations involving content containing the Keywords. The Deriver is free to create a derivative from a Prime Data Element by modifying the values of Keywords if necessary. The locations of the Keywords are simply recorded where found upon scanning the input data, and the inverted index is updated. Upon a content associative search based on the Keywords, the inverted index is queried and all locations of the Keywords are obtained.
In other embodiments, the Keywords are not required to exist in the data (the absence of Keywords in the data does not invalidate the data), but the Prime Data Sieve is required to contain all Elements that contain Keywords, and the Deriver is constrained with regards to derivations involving content containing the Keywords—no derivations are allowed other than reducing duplicates. The purpose of these embodiments is that all distinct Elements containing any Keyword must exist in the Prime Data Sieve. This is an example where the rules governing the selection of Prime Data are conditioned by the Keywords. In these embodiments, a modified inverted index may be created which contains, for each Keyword, a Reverse Reference to each Prime Data Element containing the Keyword. On these embodiments, powerful Keyword-based search capability is realized, wherein searching only the Prime Data Sieve is as effective as searching the entire data.
Other embodiments may exist where the Deriver is constrained so that the Reconstitution Program is not allowed to perturb or modify the contents of any Keyword found in the Prime Data Element, in order to formulate a Candidate Element as a Derivative Element of that Prime Data Element. The Keyword needs to propagate unchanged from the Prime Data Element to the Derivative. If the Deriver needs to modify bytes of any Keyword found in the Prime Data Element in order to successfully formulate the candidate as a derivative of this Prime Data Element, the Derivative may not be accepted, and the candidate must be installed as a fresh Prime Data Element in the Sieve.
The Deriver may be constrained in a variety of ways with regards to derivations involving the Keywords so that the rules governing the selection of Prime Data are conditioned by the Keywords.
The apparatus for Search of data using Keywords can accept updates to the listing of Keywords. Keywords can be added without any changes to the data that is stored in losslessly reduced form. When new Keywords are added, fresh incoming data can be parsed against the updated Keyword list, and the inverted index updated with the incoming data subsequently being stored in losslessly reduced form. If the existing data (that is already stored in losslessly reduced form) needs to be indexed against the new Keywords, the apparatus can progressively read in the Distilled Files (either one or more Distilled Files at a time, or one Losslessly Reduced Data Lot at a time), reconstitute the original files (but without disturbing the losslessly reduced stored data), and parse the reconstituted files to update the inverted index. All this while, the entire data repository can continue to remain stored in losslessly reduced form.
In summary, the enhanced Data Distillation apparatus enables powerful multidimensional content associative search and retrieval on data that is stored in losslessly reduced form.
The Data Distillation™ apparatus can be employed for the purposes of lossless reduction of audio and video data. The data reduction accomplished by the method is achieved by deriving components of the audio and video data from prime data elements resident in a content associative sieve. Applications of the method for such purposes will now be described.
In MP3, the incoming audio stream is compressed into a sequence of several small data frames, each containing a frame header and compressed audio data. The original audio stream is periodically sampled to produce a sequence of snippets of audio which are then compressed employing Perceptual Coding and Huffman Coding to produce a sequence of MP3 data frames. Both the Perceptual Coding and Huffman Coding techniques are applied locally within each snippet of the audio data. The Huffman Coding technique exploits redundancy locally within a snippet of audio but not globally across the audio stream. Thus the MP3 techniques do not exploit redundancy globally—neither across a single audio stream, nor between multiple audio streams. This represents an opportunity for further data reduction beyond what MP3 can achieve.
Each MP3 data frame represents an audio snippet of 26 ms. Each frame stores 1152 samples and is subdivided into two granules each containing 576 samples. As can be seen in the Encoder Block Diagram in
In
The above described method enables the exploitation of redundancy at a global scope, across multiple Audio Granules stored in the apparatus. MP3 Encoded Data files may be transformed into Distilled MP3 Data and stored in losslessly reduced form. When needed to be retrieved, the data retrieval process (employing Retriever 1871 and Reconstitutor 1872) can be invoked to reconstitute the MP3 Encoded Data 1873. In the apparatus shown in
In this fashion, the Data Distillation Apparatus may be adapted and employed to further reduce the size of MP3 audio files.
In another variation of the scheme described, upon receiving an MP3 Encoded Stream, the Parser/Factorizer takes the entire Main Data field as a Candidate Element for derivation or as a Prime Data Element for installation into the Prime Data Sieve. In this variation, all Elements will continue to remain Huffman Coded, and Reconstitution Programs will operate upon Elements that are already Huffman Coded. This variation of the Data Distillation Apparatus may also be employed to further reduce the size of MP3 audio files.
In a manner similar to that described in the preceding sections and illustrated in
The above described method enables the exploitation of redundancy at a global scope, across multiple I-frames of multiple video data sets stored in the apparatus. When needed to be retrieved, the data retrieval process (employing Retriever 1911 and Reconstitutor 1912) can be invoked to reconstitute the Video Data 1913. In the apparatus shown in
In this fashion, the Data Distillation Apparatus may be adapted and employed to further reduce the size of video files.
The above description is presented to enable any person skilled in the art to make and use the embodiments. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein are applicable to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
The data structures and code described in this disclosure can be partially or fully stored on a computer-readable storage medium and/or a hardware module and/or hardware apparatus. A computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media, now known or later developed, that are capable of storing code and/or data. Hardware modules or apparatuses described in this disclosure include, but are not limited to, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), dedicated or shared processors, and/or other hardware modules or apparatuses now known or later developed.
The methods and processes described in this disclosure can be partially or fully embodied as code and/or data stored in a computer-readable storage medium or device, so that when a computer system reads and executes the code and/or data, the computer system performs the associated methods and processes. The methods and processes can also be partially or fully embodied in hardware modules or apparatuses, so that when the hardware modules or apparatuses are activated, they perform the associated methods and processes. Note that the methods and processes can be embodied using a combination of code, data, and hardware modules or apparatuses.
The foregoing descriptions of embodiments of the present invention have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2021/031505 | 5/10/2021 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/231255 | 11/18/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
9286313 | Sharangpani | Mar 2016 | B1 |
10664165 | Faibish | May 2020 | B1 |
10970195 | Hicks | Apr 2021 | B2 |
11210032 | Nagao | Dec 2021 | B1 |
11385806 | Banerjee | Jul 2022 | B1 |
11526469 | Mathews | Dec 2022 | B1 |
20170123676 | Singhai et al. | May 2017 | A1 |
20200304142 | Nakanishi | Sep 2020 | A1 |
20200341667 | Bassov | Oct 2020 | A1 |
20200341671 | Gonczi | Oct 2020 | A1 |
20210124532 | Shabi | Apr 2021 | A1 |
20220269601 | Monteith | Aug 2022 | A1 |
Number | Date | Country |
---|---|---|
20182008625 | Nov 2018 | WO |
Entry |
---|
Zhike Zhang et al., “Reverse Deduplication: Optimizing for Fast Restore”, Usenix, The advanced computing systems association, Apr. 11, 2013. |
Number | Date | Country | |
---|---|---|---|
20230198549 A1 | Jun 2023 | US |
Number | Date | Country | |
---|---|---|---|
63022945 | May 2020 | US |