1. Field of the Invention
The present invention relates to the area of lossless data compression. More specifically, the present invention relates to a method for detecting and separating data blocks with stationary informational characteristics as a preliminary stage for lossless data compression.
2. Background Discussion
Lossless compression is a process of economically representing source information. Lossless compression methods are used in many areas, principally in data storage and data transmission, and improved coding techniques are continually sought to reduce the amount of memory required to store data and/or to increase the amount of data that can be transmitted over a communications channel in a given amount of time. As a general rule, lossless compression methods are more efficient when applied to particular types of data they are specifically designed to compress. Furthermore, many compression methods are sensitive to changes in data characteristics. Detecting and separating data blocks of a particular type with stationary informational characteristics (solid block splitting) is a very important preliminary stage in many compression technologies, especially those operating on independent blocks (such as Burrows-Wheeler compression).
There have been several efforts to implement an intelligent data block detection mechanism. However, while data type detection is more or less successful in most cases, existing solid block splitting solutions are less than satisfactory.
A widely used and straightforward approach to solid block splitting works as follows: Small portions, or blocks (b), of data within an input data set are consecutively analyzed, and decisions are made on whether they should be added to a solid block (B) currently being formed. If not, a rejected block is considered to be the beginning of the next solid block to be formed. The decision is usually based on a comparison function F(B, b) and a threshold δ:
F(B,b)≦δB=B+b.
The following terms have the definitions set out below.
N—alphabet size;
Xε{B, b}—block;
XΣ—the number of symbols in block X;
Xsym—the number of times symbol, sym, appears in block X;
S—the number of different possible statistical states (statistical states allow the use of comprehensive context-based estimations);
Xst—the number of times statistical state st appears in block X;
Xsymst—the number of times symbol sym appears in statistical state st within block X;
psymst({Xji}iε{1,K,S}, jε{1,K,N})—estimated probability of symbol sym appearing in state st within block X;
pst({Xji}iε{1,K,S}, jε{1,K,N})—estimated probability of state st appearing within block X.
The following formulas describe evident relations between these quantities:
In most cases, probability estimations are calculated using the following two formulas:
Collected statistics may be insufficient if blocks are small but the number of states is large. Therefore, it may be impossible to estimate probabilities reliably. High computational complexity is also a problem, because recalculating probabilities for every block may be unacceptable. With increasing block size, the complexity problem may diminish and probability estimation becomes more precise, but block splitting efficiency still cannot be guaranteed as relatively large blocks cannot precisely separate small areas of data with stationary informational characteristics.
In practical data compression technologies, a state-based approach is rarely applied. In many cases it is assumed that S=1, and calculations are based on a simplified formula:
Such simplification leads to a performance tradeoff: it reduces computational complexity but may negatively affect the precision of block splitting.
The following two comparison functions are usually employed in practice:
The first function is a comparison of empirical entropies of blocks B and b. If entropies are close, blocks B and b are considered to be parts of one solid block. The second function is a comparison of two different estimations of the size of the compressed representation of blocks B and b. One estimation (in the numerator) assumes that blocks are compressed together, while the other estimation (in the denominator) assumes that blocks are compressed separately. The first function is easier to calculate, but the result of the comparison does not guarantee efficient block separation in terms of final compression efficiency. The second function, while requiring more computational resources, is more suitable for practical compression applications. Nevertheless, because of inaccurate probability estimation for small blocks b, splitting may give unpredictable results.
The present invention is a method of separating data blocks with stationary informational characteristics. The invention employs a new comparison function, providing improved overall compression efficiency. The low complexity of the inventive method makes it more suitable for practical applications.
Certain portions of the detailed description set out below employ algorithms, arithmetic, or other symbolic representations of operations performed on data stored within a computing system. The nomenclature employed is common among those with skill in the art to communicate the substance of their understanding to others similarly skilled and knowledgeable. The operations discussed are performed on electrical and/or magnetic signals stored or capable of being stored, as bits, data, values, characters, elements, symbols, characters, terms, numbers, and the like, within the computer system processors, memory, registers, or other information storage, transmission, or display devices. The actions or processes involve the transformation of physical electronic and/or magnetic quantities within such storage, transmission, or display devices.
The present invention follows the approach described in the Background Discussion, above, but uses a new comparison function, set out as follows:
This function calculates the relative change in the estimated compression efficiency for a solid block B caused by merging its statistics with the statistics of a block b and using the merged statistics (rather than the original statistics of the block B) for probability estimation.
The comparison function can be significantly simplified:
The computational complexity of this improved comparison function is significantly lower than the computational complexity of previously used functions. Practical implementation has shown that the inventive method outperforms all previous methods in terms of efficiency. This can be explained partly by the fact that the precision of the new comparison function does not significantly depend on the size of the block b.
As a further improvement, a new probability estimation function is employed:
This combination of state-based and state-independent approaches avoids the problems typical of both approaches when they are applied separately. The result is noticeably improved block splitting efficiency. Moreover, the inventive probability estimation technique does not incur a significant performance cost compared to standard probability estimations.
Another improvement comes from combining block splitting and data type detection procedures. When a new block b has been detected and the detection process has indicated that blocks B and b are of different types, further comparison of these blocks is not necessary and the block b is considered to be the beginning of the next solid block. This not only improves compression efficiency but also eliminates unnecessary calculations.
From the foregoing, and by reference to
Number | Name | Date | Kind |
---|---|---|---|
5717394 | Schwartz et al. | Feb 1998 | A |
6678419 | Malvar | Jan 2004 | B1 |
6895101 | Celik et al. | May 2005 | B2 |
7126506 | Malvar | Oct 2006 | B2 |
7286710 | Marpe et al. | Oct 2007 | B2 |
7379608 | Marpe et al. | May 2008 | B2 |
7417570 | Srinivasan et al. | Aug 2008 | B2 |
7421138 | Van Der Vleuten | Sep 2008 | B2 |
7580585 | Malvar | Aug 2009 | B2 |
7770091 | Monro | Aug 2010 | B2 |
7796058 | Winter | Sep 2010 | B2 |
7845571 | Monro | Dec 2010 | B2 |
7872596 | Schneider | Jan 2011 | B2 |
7925639 | Vo et al. | Apr 2011 | B2 |
8335253 | Marpe et al. | Dec 2012 | B2 |
20090122868 | Chen et al. | May 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
61406070 | Oct 2010 | US |