Method of generating Huffman code length information

Information

  • Patent Grant
  • 6987469
  • Patent Number
    6,987,469
  • Date Filed
    Tuesday, June 3, 2003
    21 years ago
  • Date Issued
    Tuesday, January 17, 2006
    19 years ago
Abstract
Embodiments of a method of generating Huffman code length information are disclosed. In one such embodiment, a data structure is employed, although, of course, the invention is not limited in scope to the particular embodiments disclosed.
Description
BACKGROUND

The present disclosure is related to Huffman coding.


As is well-known, Huffman codes of a set of symbols are generated based at least in part on the probability of occurrence of source symbols. A binary tree, commonly referred to as a “Huffman Tree” is generated to extract the binary code and the code length. See, for example, D. A. Huffman, “A Method for the Construction of Minimum—Redundancy Codes,” Proceedings of the IRE, Volume 40 No. 9, pages 1098 to 1101, 1952. D. A. Huffman, in the aforementioned paper, describes the process this way:

  • List all possible symbols with their probabilities;
  • Find the two symbols with the smallest probabilities;
  • Replace these by a single set containing both symbols, whose probability is the sum of the individual probabilities;
  • Repeat until the list contains only one member.


    This procedure produces a recursively structured set of sets, each of which contains exactly two members. It, therefore, may be represented as a binary tree (“Huffman Tree”) with the symbols as the “leaves.” Then to form the code (“Huffman Code”) for any particular symbol: traverse the binary tree from the root to that symbol, recording “0” for a left branch and “1” for a right branch. One issue, however, for this procedure is that the resultant Huffman tree is not unique. One example of an application of such codes is text compression, such as GZIP. GZIP is a text compression utility, developed under the GNU (Gnu's Not Unix) project, a project with a goal of developing a “free” or freely available UNIX-like operation system, for replacing the “compress” text compression utility on a UNIX operation system. See, for example, Gailly, J. L. and Adler, M., GZIP documentation and sources, available as gzip-1.2.4.tar at the website “http://www.gzip.org”. In GZIP, Huffman tree information is passed from the encoder to the decoder in terms of a set of code lengths along with compressed text. Both the encoder and decoder, therefore, generate a unique Huffman code based upon this code-length information. However, generating length information for the Huffman codes by constructing the corresponding Huffman tree is inefficient. In particular, the resulting Huffman codes from the Huffman tree are typically abandoned because the encoder and the decoder will generate the same Huffman codes from the code length information. It would, therefore, be desirable if another approach for generating the code length information were available.





BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of this specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:



FIG. 1 is a table illustrating a set of symbols with their corresponding frequency to which an embodiment in accordance with the present invention may be applied;



FIG. 2 is a table illustrating a first portion of an embodiment in accordance with the present invention, after initialization for the data shown in FIG. 1;



FIG. 3 is a table illustrating a second portion of an embodiment of the present invention, after initialization for the data shown on FIG. 2;



FIG. 4 is the table of FIG. 2, after a first merging operation has been applied;



FIG. 5 is the table of FIG. 3, after a first merging operation has been applied;



FIG. 6 is the table of FIG. 5, after the merging operations have been completed; and



FIG. 7 is the table of FIG. 4, after the merging operations have been completed.





DETAILED DESCRIPTION

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.


As previously described, Huffman codes for a set of symbols are generated based, at least in part, on the probability of occurrence of the source symbols. Accordingly, a binary tree, commonly referred to as a Huffman tree, is generated to extract the binary code and the code length. For example, in one application for text compression standards, such as GZIP, although, of course, the invention is limited in scope to this particular application, the Huffman tree information is passed from encoder to decoder in terms of a set of code lengths with the compressed text data. Both the encoder and decoder generate a unique Huffman code based on the code length information. However, generating the length information for the Huffman codes by constructing the corresponding Huffman tree is inefficient and often redundant. After the Huffman codes are produced from the Huffman tree, the codes are abandoned because the encoder and decoder will generate the Huffman codes based on the length information. Therefore, it would be desirable if the length information could be determined without producing a Huffman tree.


One embodiment, in accordance with the invention of a method of generating code lengths, for codes; to be encoded, using a data structure, is provided. In this particular embodiment, the data structure is sorted, symbols in the data structure are combined, and symbol length is updated based, at least in part, on the frequency of the symbols being coded. In this particular embodiment, the data structure aides in the extraction of lengths of Huffman codes from a group of symbols without generating a Huffman tree where the probability of occurrence of the symbols is known. Although the invention is not limited in scope to this particular embodiment, experimental results show efficiency both in terms of computation and usage of memory suitable for both software and hardware implementation.



FIG. 1 is a table illustrating a set of symbols with their corresponding frequency, although, of course, this is provided simply as an alternative example. An embodiment of a method of generating code lengths in accordance with the present invention may be applied to this set of symbols. FIG. 1 illustrates a set of 18 symbols, although of course the invention is not limited in scope in this respect. In this particular example, although, again, the invention is not limited in scope in this respect, inspection of the frequency information reveals two symbols, index no. 7 and 13 of the shaded regions in FIG. 1, do not occur in this symbol set. Therefore, these symbols need not be considered for Huffman coding. In this particular embodiment, symbols having a zero frequency are omitted, although the invention is not restricted in scope in this respect.


In this particular embodiment, although, again, the invention is not limited in scope in this respect, the data structure to be employed has at least two portions. As has previously been indicated, it is noted that the invention is not restricted in scope to this particular data structure. Clearly, many modifications to this particular data structure may be made and still remain within the spirit and scope of what has been described. For this embodiment, however, one portion is illustrated in FIG. 2. This portion of the data structure tracks or stores the index and length information for each non-zero frequency symbol. As illustrated in FIG. 2, this portion is initialized with zero length in descending order in terms of frequency and symbol index. Of course, other embodiments are applicable, such as using ascending order, for example. FIG. 2 illustrates this first portion of an V embodiment applied to the symbols of FIG. 1.


As illustrated, FIG. 2 includes 16 entries, zero to 15, corresponding to the 16 non-zero frequency symbols. In this particular data structure, although the invention is not limited in scope in this respect, the first field or column shows the associated symbol indices after the previously described sorting operation. The symbol frequency information illustrated in FIG. 2 is not part of the data structure, but is provided here merely for illustration purposes. It illustrates the descending order of the symbols in terms of frequency, in this example. The second field or column of the data structure, although, again, the invention is not limited in scope in this respect or to this particular embodiment, contains the length information for each symbol and is initialized to zero.


The second part or portion of the data structure for this particular embodiment, after initialization using the data or symbols in FIG. 2, is shown or illustrated in FIG. 3. In this particular embodiment, the first field of this portion of the data structure, that is the portion illustrated in FIG. 3, contains the frequency for the group. The second field for this particular embodiment contains bit flags. The bit flags correspond to or indicate the entry number of the symbols belonging to the group. For example, as illustrated in FIG. 3, the shaded area contains a symbol with entry no. 3. For this particular symbol, the group frequency is 3 and the bit flags are set to:

    • bit number: (15 . . . 3210)
    • bit value: 0000 0000 0000 1000


      that is, bit number 3 is set to “1” in this example, while the remaining bits are set to “0”.


As previously described, initially, the symbol to be coded is assigned a different bit flag for each symbol. Again, in this particular embodiment, although the invention is, again, not limited in scope in this respect, the code length initially comprises zero for each symbol. As shall be described in more detail hereinafter, in this particular embodiment, with the data structure initialized, symbol flags are combined beginning with the smallest frequency symbols. The symbols are then resorted and frequency information is updated to reflect the combination. These operations of combining signal flags and resorting are then repeated until no more symbols remain to be combined.


As previously described, the process is begun by initializing the data structure, such as the embodiment previously described, and setting a “counter” designated here “no_of_group”, to the number of non-zero frequency symbols, here 16. Next, while this “counter,” that is, no_of_group, is greater than one, the following operations are performed.


Begin






    • 1: Initialize the data structure (both parts I and II) as described above, and set the no_of_group to the number of non-zero frequency symbols.

    • 2: while (no_of_group>1){
      • 2.1: Merge the last two groups in the data structure of part II, and insert it back into the list. /* The merge operation for the group frequency is simply add them together, and the merge operation for the second field is simply bit-wise “OR” operation. Both are very easy to implement in term of software and hardware. FIG. 5 shows as an example for this step. As we can see the last two groups are merged and insert backed into the list (shown in shading area). Since we are always merging two groups into one, the memory can be reused and we do not need to dynamically allocate any new memory after initialization */

    • 2.2: Update the length information in the data structure of part I. /* This step is done by scanning the “1” bits in the merged bit-flags (second field in the data structure of part II), and increases the Length information by one in the corresponding entries in the data structure. FIG. 4 shows the updates after the merge-step shown in FIG. 5. */

    • 2.3: Reduce no_of_group by one.





}/* end of while */


End


As illustrated in FIG. 5, for example, the last two “groups” or “rows” in the second part or portion of the data structure are combined or merged and, as illustrated in FIG. 5, this portion of the data structure is resorted, that is, the combined symbols are sorted in the data structure appropriately based upon group frequency, in this particular embodiment.


It is likewise noted, although the invention is not limited in scope in this respect, that the merger or combining operation for the group frequency may be implemented in this particular embodiment by simply adding the frequencies together and a merger/combining operation for the second field of the data structure for this particular embodiment may be implemented as a “bitwise” logical OR operation. This provides advantages in terms of implementation in software and/or hardware. Another advantage of this particular embodiment is efficient use of memory, in addition to the ease of implementation of operations, such as summing and logical OR operations.


As previously described, a combining or merge operation results in two “groups” or “rows” being combined into one. Therefore, memory that has been allocated may be reused and the dynamic allocation of new memory after initialization is either reduced or avoided.


Next, the length information in the first portion or part of the data structure for this particular embodiment is updated to reflect the previous merging or combining operation. This is illustrated, for example, for this particular embodiment, in FIG. 4. One way to implement this operation, although the invention is not restricted in scope in this respect, is by scanning the “one” bits of the merged bit flags. That is, in this particular embodiment, the second field in the second portion of the data structure, is scanned and length information is increased or augmented by one in the corresponding entries in the first portion or part of the data structure.


Next the “counter” that is here, no_of_group, is reduced by one. The previous operations are repeated until the counter reaches the value one in this particular embodiment.


It should be noted that for this particular embodiment, once the “counter” reaches one, as illustrated in FIG. 6, there should be one group or row in the second portion of the data structure with a group frequency equal to the total group frequency and all bits in the bit flags should be set to one. However, likewise, FIG. 7 shows the final results of the code length information where this has occurred. Therefore, as illustrated in FIG. 7, the desired code length information is obtained.


As previously described, for this particular embodiment of a method of generating code length information, several advantages exist. As previously discussed, in comparison, for example, with generating the Huffman tree, memory usage is reduced and the dynamic allocation of memory may be avoided or the amount of memory to be dynamically allocated is reduced. Likewise, computational complexity is reduced.


Likewise, as previously described, operations employed to implement the previously described embodiment are relatively easy to implement in hardware or software, although the invention is not limited in scope to those embodiments in these particular operations. Thus, Huffman code length information may be extracted or produced without generating a Huffman tree.


In an alternative embodiment in accordance with the present invention, a method of encoding symbols may comprise encoding symbols using code length information; and generating the code length information without using a Huffman tree, such as, for example, using the embodiment previously described for generating code length information, although the invention is, of course, not limited in scope to the previous embodiment. It is, of course, understood in this context, that the length information is employed to encode symbols where the length information is generated from a Huffman code. Likewise, in another alternative embodiment in accordance with the present invention, a method of decoding symbols may comprise decoding symbols, wherein the symbols have been encoded using code length information and the code length information was generated without using a Huffman tree. It is, again, understood in this context, that the length information employed to encode symbols is generated from a Huffman code. Again, one approach to generate the code length information comprises the previously described embodiment.


It will, of course, be understood that, although particular embodiments have just been described, the invention is not limited in scope to a particular embodiment or implementation. For example, one embodiment may be in hardware, whereas another embodiment may be in software. Likewise, an embodiment may be in firmware, or any combination of hardware, software, or firmware, for example. Likewise, although the invention is not limited in scope in this respect, one embodiment may comprise an article, such as a storage medium. Such a storage medium, such as, for example, a CD-ROM, or a disk, may have stored thereon instructions, which when executed by a system, such as a computer system or platform, or an imaging system, may result in an embodiment of a method in accordance with the present invention being executed, such as a method of generating Huffman code length information, for example, as previously described. Likewise, embodiments of a method of initializing a data structure, encoding symbols, and/or decoding symbols, in accordance with the present invention, may be executed.


While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes and equivalents will now occur to those skilled in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

Claims
  • 1. A method of generating, for symbols to be coded, code lengths, using a data structure, said method comprising: sorting the data structure, combining symbols in the data structure, and updating symbol length, based, at least in part, on the frequency of the symbols being coded, each symbol to be coded being initially assigned a flag and the same length.
  • 2. The method of claim 1, wherein initially each symbol to be coded is assigned a different flag.
  • 3. The method of claim 2, wherein the same length initially comprises zero.
  • 4. The method of claim 2, wherein the data structure comprises at least two portions; a first portion comprising symbol index and associated symbol length information and a second portion comprising group frequency and assign bit flag information.
  • 5. The method of claim 4, wherein the symbols are sorted in the data structure based on frequency in descending order.
  • 6. The method of claim 5, wherein symbols are combined in the data structure beginning with the smallest frequency symbols.
  • 7. The method of claim 6, wherein, after the symbol length information is updated to reflect the combined symbols in the data structure, the symbols are resorted based on frequency in descending order.
  • 8. The method of claim 4, wherein the symbols are sorted in the data structure based on frequency in ascending order.
  • 9. The method of claim 8, wherein symbols are combined in the data structure beginning with the smallest frequency symbols.
  • 10. The method of claim 9, wherein, after the symbol length information is updated to reflect the combined symbols in the data structure, the symbols are resorted based on frequency in ascending order.
  • 11. The method of claim 1, wherein symbols having a zero frequency are omitted.
  • 12. A method of generating code lengths for a grouping of symbols to be coded in accordance with a Huffman code, comprising: (a) sorting the symbols by frequency and assigning a flag and the same initial length to each symbol; (b) combining symbol flags beginning with the smallest frequency symbols; (c) resorting the symbols and updating length information to reflect the combination; and repeating (b) and (c) until no more symbols remain to be combined.
  • 13. The method of claim 12, wherein sorting the symbols by frequency includes omitting the symbols having a zero frequency.
  • 14. The method of claim 12, wherein the same initial length comprises zero.
  • 15. A data structure comprising: at least two portions; a first portion comprising symbol indices, wherein said symbol indices are sorted by frequency; and a second portion comprising group frequency information and an assigned flag corresponding to each respective symbol.
  • 16. The data structure of claim 15, wherein the symbols are sorted in the data structure in descending order by frequency.
  • 17. The data structure of claim 15, wherein the symbols are sorted in the data structure in ascending order by frequency.
  • 18. An article comprising: a storage medium, said storage medium having stored thereon, instructions that, when executed, result in the following: generating, using a data structure, code lengths for symbols to be coded, and initially assigning each symbol to be coded a flag, the generating comprising: sorting the data structure, combining symbols in the data structure, and updating symbol length, based, at least in part, on the frequency of the symbols being coded.
  • 19. The article of claim 18, wherein said instructions, when executed, result in the data structure comprising at least two portions; a first portion comprising symbol index and associated symbol length information and a second portion comprising group frequency and assign bit flag information.
  • 20. An article comprising: a storage medium, said storage medium having stored thereon, instructions that, when executed, result in the following: initializing a data structure usable in generating code lengths for symbols to be coded, the initializing comprising: sorting the symbols by frequency and assigning a flag and the same initial length to each symbol.
  • 21. The article of claim 20 wherein said instructions, when executed, further result in each symbol being assigned an initial length of zero.
  • 22. The article of claim 20, wherein said instructions, when executed, further result in, the data structure including group frequency information for each symbol.
  • 23. A method of encoding symbols comprising: encoding symbols using code length information; generating, using a data structure, the code length information without using a Huffman tree, the data structure including group frequency information for each symbol.
  • 24. The method of claim 23, wherein said data structure includes symbol indices and an initially assigned flag and code length.
  • 25. A method of decoding symbols comprising: decoding symbols, wherein the symbols have been encoded using code length information and the code length information was generated using a data structure, and without using a Huffman tree, the data structure including symbol indices.
  • 26. The method of claim 25, wherein the data structure comprises group frequency information for each symbol and an initially assigned flag and code length.
RELATED APPLICATIONS

This patent application is a continuation of U.S. patent application Ser. No. 09/704,392, filed Oct. 31, 2000 now U.S. Pat. No. 6,636,167, titled “A Method of Generating Huffman Code Length Information.” The subject patent application also is related to U.S. patent application Ser. No. 09/704,380, filed Oct. 31, 2000, titled “A Method of Performing Huffman Decoding,” by Acharya et al., assigned to the assignee of the present invention and herein incorporated by reference. The subject patent application also is related to U.S. patent application Ser. No. 10/293,187, titled “A Method of Performing Huffman Decoding,” by Acharya et al., assigned to the assignee of the present invention. The subject patent application also is related to U.S. patent application Ser. No. 10/391,892, titled “A Method of Performing Huffman Decoding,” by Acharya et al., assigned to the assignee of the present invention.

US Referenced Citations (121)
Number Name Date Kind
4813056 Fedele Mar 1989 A
5467088 Kinouchi et al. Nov 1995 A
5778371 Fujihara Jul 1998 A
5875122 Acharya Feb 1999 A
5973627 Bakhmutsky Oct 1999 A
5995210 Acharya Nov 1999 A
6009201 Acharya Dec 1999 A
6009206 Acharya Dec 1999 A
6047303 Acharya Apr 2000 A
6075470 Little et al. Jun 2000 A
6091851 Acharya Jul 2000 A
6094508 Acharya et al. Jul 2000 A
6108453 Acharya Aug 2000 A
6124811 Acharya et al. Sep 2000 A
6130960 Acharya Oct 2000 A
6151069 Dunton et al. Nov 2000 A
6151415 Acharya et al. Nov 2000 A
6154493 Acharya et al. Nov 2000 A
6166664 Acharya Dec 2000 A
6178269 Acharya Jan 2001 B1
6195026 Acharya Feb 2001 B1
6215908 Pazmino et al. Apr 2001 B1
6215916 Acharya Apr 2001 B1
6229578 Acharya et al. May 2001 B1
6233358 Acharya May 2001 B1
6236433 Acharya et al. May 2001 B1
6236765 Acharya May 2001 B1
6269181 Acharya Jul 2001 B1
6275206 Tsai et al. Aug 2001 B1
6285796 Acharya et al. Sep 2001 B1
6292114 Tsai et al. Sep 2001 B1
6292144 Taflove et al. Sep 2001 B1
6301392 Acharya Oct 2001 B1
6348929 Acharya Feb 2002 B1
6351555 Acharya et al. Feb 2002 B1
6356276 Acharya Mar 2002 B1
6366692 Acharya Apr 2002 B1
6366694 Acharya Apr 2002 B1
6373481 Tan et al. Apr 2002 B1
6377280 Acharya et al. Apr 2002 B1
6381357 Tan et al. Apr 2002 B1
6392699 Acharya May 2002 B1
6449380 Acharya et al. Sep 2002 B1
6535648 Acharya Mar 2003 B1
6556242 Dunton et al. Apr 2003 B1
6563439 Acharya et al. May 2003 B1
6563948 Tan et al. May 2003 B2
6574374 Acharya Jun 2003 B1
6600833 Tan et al. Jul 2003 B1
6608912 Acharya et al. Aug 2003 B2
6625308 Acharya et al. Sep 2003 B1
6625318 Tan et al. Sep 2003 B1
6628716 Tan et al. Sep 2003 B1
6628827 Acharya Sep 2003 B1
6633610 Acharya Oct 2003 B2
6636167 Acharya et al. Oct 2003 B1
6639691 Acharya Oct 2003 B2
6640017 Tsai et al. Oct 2003 B1
6646577 Acharya et al. Nov 2003 B2
6650688 Acharya et al. Nov 2003 B1
6653953 Becker et al. Nov 2003 B2
6654501 Acharya et al. Nov 2003 B1
6658399 Acharya et al. Dec 2003 B1
6662200 Acharya Dec 2003 B2
6678708 Acharya Jan 2004 B1
6681060 Acharya et al. Jan 2004 B2
6690306 Acharya et al. Feb 2004 B1
6694061 Acharya Feb 2004 B1
6697534 Tan et al. Feb 2004 B1
6707928 Acharya et al. Mar 2004 B2
6725247 Acharya Apr 2004 B2
6731706 Acharya et al. May 2004 B1
6731807 Pazmino et al. May 2004 B1
6738520 Acharya et al. May 2004 B1
6748118 Acharya et al. Jun 2004 B1
6751640 Acharya Jun 2004 B1
6757430 Metz et al. Jun 2004 B2
6759646 Acharya et al. Jul 2004 B1
6766286 Acharya Jul 2004 B2
6775413 Acharya Aug 2004 B1
6795566 Acharya et al. Sep 2004 B2
6795592 Acharya et al. Sep 2004 B2
6798901 Acharya et al. Sep 2004 B1
6813384 Acharya et al. Nov 2004 B1
6825470 Bawolek et al. Nov 2004 B1
6834123 Acharya et al. Dec 2004 B2
20020063789 Acharya et al. May 2002 A1
20020063899 Acharya et al. May 2002 A1
20020101524 Acharya Aug 2002 A1
20020118746 Kim et al. Aug 2002 A1
20020122482 Hyun et al. Sep 2002 A1
20020161807 Acharya Oct 2002 A1
20020174154 Acharya Nov 2002 A1
20020181593 Acharya et al. Dec 2002 A1
20030021486 Acharya Jan 2003 A1
20030053666 Acharya et al. Mar 2003 A1
20030063782 Acharya et al. Apr 2003 A1
20030067988 Kim et al. Apr 2003 A1
20030072364 Kim et al. Apr 2003 A1
20030108247 Acharya Jun 2003 A1
20030123539 Kim et al. Jul 2003 A1
20030126169 Wang et al. Jul 2003 A1
20030174077 Acharya et al. Sep 2003 A1
20030194008 Acharya et al. Oct 2003 A1
20030194128 Acharya et al. Oct 2003 A1
20030210164 Acharya et al. Nov 2003 A1
20040017952 Acharya et al. Jan 2004 A1
20040022433 Acharya et al. Feb 2004 A1
20040042551 Acharya et al. Mar 2004 A1
20040047422 Acharya et al. Mar 2004 A1
20040057516 Kim et al. Mar 2004 A1
20040057626 Acharya et al. Mar 2004 A1
20040071350 Acharya et al. Apr 2004 A1
20040080513 Acharya Apr 2004 A1
20040146208 Pazmino et al. Jul 2004 A1
20040158594 Acharya Aug 2004 A1
20040169748 Acharya Sep 2004 A1
20040169749 Acharya Sep 2004 A1
20040172433 Acharya et al. Sep 2004 A1
20040174446 Acharya Sep 2004 A1
20040240714 Acharya et al. Dec 2004 A1
Foreign Referenced Citations (1)
Number Date Country
0 907 288 Apr 1999 EP
Related Publications (1)
Number Date Country
20030210164 A1 Nov 2003 US
Continuations (1)
Number Date Country
Parent 09704392 Oct 2000 US
Child 10454553 US