The present invention relates to methods and systems for compressing electronic data for storage or transmission; and in particular to an improved method and system for lossless data compression utilizing multiple encoding tables.
The amount of data generated, collected and saved by businesses is increasing at an unprecedented rate. Businesses are retaining enormous amounts of detailed data, such as call detail records, transaction history, and web clickstreams, and then mining it to identify business value. Regulatory and legal retention requirements are requiring businesses to maintain years of accessible historical data.
As businesses enter an era of petabyte-scale data warehouses, advanced technologies, such as data compression are increasingly utilized to effectively maintain enormous data volumes in the warehouse. Data compression reduces storage cost by storing more logical data per unit of physical capacity. Performance is improved because there is less physical data to retrieve during database queries.
Character data, comprising alphanumeric data or text, must be encoded into bytes when used on a computer system. The amount of storage required for the storage of this type of data depends crucially on the encoding scheme utilized. Uncompressed character data stored within a typical database system requires one byte per character when storing most character data, and two or more bytes per character for East Asian characters. As greater amounts of data are saved within database systems, the need for compressing data, including character data, becomes increasingly vital. A more storage efficient way of encoding character data is needed.
Data storage, retrieval, and manipulation must be performed expeditiously in order to satisfy user's demands. Compressing, and particularly decompressing data must be performed with negligible effects on database and data transmission operations. Additionally, it is advantageous if the data can be manipulated in its compressed form.
Many compression schemes require a significant amount of storage overhead to provide compression, so an advantage is only realized when the strings being compressed are long. It is beneficial if a compression scheme significantly compresses short strings as well as long strings.
Described below is a character data compression scheme that provides a more storage efficient process for compressing character data, provides fast compression and decompression of data, produces a compressed data format which can be manipulated in the compressed form, and can be utilized to compress short strings as well as long strings.
In the following description, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable one of ordinary skill in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that structural, logical, optical, and electrical changes may be made without departing from the scope of the present invention. The following description is, therefore, not to be taken in a limited sense, and the scope of the present invention is defined by the appended claims.
The storage device 102, in some embodiments, is a hard disk resident on a computing device and can be accessed by the database management system 104. In some embodiments, the database management system 104 and the storage device are located on the same computing device, such as a server computer. In other embodiments, the storage device 102 includes one or more computing devices such as a storage server, a storage area network, or another suitable device.
The client 116, in some embodiments, is a client computer and includes a data access application or utility 118. The client 116 is communicatively coupled to the database management system 104 and may submit queries to and receive query results from the database management system 104.
The database management system 104, in typical embodiments, is a relational database management system. The database management system includes a file system 106 and a memory 110.
The memory 110 is a memory that is resident on a computing device on which the database management system 104 operates. The memory 110 is illustrated within the database management system 104 merely as a logical representation.
The memory 110 includes a cache 112 that holds transient data under control of the file system 106. Data written from the database management system 104 to storage device 102 is first written to the cache 112 and eventually is flushed from the cache 112 after the file system 106 writes the data to the storage device 102. Also, data retrieved from the storage device 102 in response to a query, or other operation, is retrieved by the file system 106 to the cache 112.
The database management system 104 includes a file system. The file system 106 typically manages data blocks that contain one or more rows of a single table. In some such embodiment, the file system 106 maintains the cache 112 in the memory 110. The cache 112 holds data blocks that have been loaded from disk or written by the database management system 104. Requests, or queries, for rows within a table are satisfied from data blocks maintained within the cache 112. Thus, when a query is made against the database, the database management system 104 identifies the relevant rows from relevant tables, and requests those rows from the file system 108. The file system 108 reads the data blocks containing those rows into cache 112. The database management system performs the query against the row or rows in the cache 112.
In some embodiments, data stored in the database may be compressed. In some such embodiments, the data stored in the database is compressed by a data compression service 108 of the database management system 104 before it is presented to the file system, and decompressed as required by the execution of a SQL request.
Referring now to
Encoding tables 211 through 215 are smaller than the size of the alphabet of the uncompressed string. The utilization of these separate smaller encoding tables for characters that occur frequently across many data sets facilitates efficient encoding of the small character subsets at the cost of some overhead switching between encoding tables.
In the embodiment described herein, the five encoding tables include a numeric alphabet encoding table 211, shown in
Numeric alphabet encoding table 211, shown in
Uppercase letter alphabet encoding table 212 and lowercase letter alphabet encoding table 213, shown in
The processes for compressing character data and decompressing compressed data using the character encoding tables 211 through 215 are shown by the flowcharts illustrated in
Referring to
In general, compression requires a careful analysis of the uncompressed character data string 201, dividing it into substrings. Each substring must be encodable by a single table. One way to divide the string into substrings is a greedy algorithm. The greedy algorithm scans through the source string. When presented with a character, it determines the smallest encoding table that includes that character. If that encoding table is currently being used, the character is simply encoded using that table. Otherwise, the current substring is terminated, and a new substring started based on this optimal encoding table.
The data when compressed using the encoding tables is composed of a sequence of segments for every continuous series of characters that belong to the same encoding table. The segments consist of an encoding table ID, the encoded characters, and the table stop indicator, as shown below:
where:
Employing the compression scheme described herein, the compressed value for the character string “CA 92127-1046” is:
The total length of the compressed value for the character string “CA 92127-1046” is 72 bits, or 9 bytes. This compares with 13 bytes for a Latin or Unicode UTF-8 encoding scheme, or 26 bytes for a Unicode UTF-16 encoding scheme.
Following encoding, in step 604 the compressed data is stored in data storage, or provided for transmission.
The process for decompressing compressed data is shown in the flowchart of
Instructions of the various software routines discussed herein, such as the methods illustrated in
The foregoing description of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed.
The number, sizes and contents of encoding tables may vary. For a specific data set, an appropriate set of encoding tables can be manually chosen or automatically generated based on the characteristics of the data, in order to achieve desirable compression rate. Meanwhile, according to the above observation 3, some algorithms can be created with general encoding tables, which will likely achieve a respectable compression rate over common character data. Hence what is proposed here is a family of compression algorithms, not just a single algorithm.
Obviously, the sample algorithm described above can be easily changed into another algorithm to better compress data that frequently switches between uppercase and lowercase letters (such as names and addresses) by combining the Lowercase Table and Uppercase Table into a single 6-bit based Mixed-case Table.
The advantages of the compression techniques described herein include:
Additional alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teaching. Accordingly, this invention is intended to embrace all alternatives, modifications, equivalents, and variations that fall within the spirit and broad scope of the attached claims.
This application claims priority under 35 U.S.C. §119(e) to the following co-pending and commonly-assigned provisional patent application, which is incorporated herein by reference: Provisional Patent Application Ser. No. 61/580,928, entitled “SYSTEM AND METHOD FOR DATA COMPRESSION USING MULTIPLE ENCODING TABLES” by Gary Roberts, Guilian Wang, and Fred Kaufmann; filed on Dec. 28, 2011.
Number | Date | Country | |
---|---|---|---|
61580928 | Dec 2011 | US |