The present invention relates generally to data compression techniques and, more particularly, to techniques for compactly encoding mark-up language documents.
Mark-up languages, such as Hypertext Mark-up Language (HTML) and Extensible Mark-up Language (XML), have been in widespread use for the past several years. Mark-up languages allow software developers to create documents that include a variety of data items, such as text, logos, pictures, and sounds, which can then be rendered by various types of programs, such as web browsers. Mark-up languages use special notations, referred to as tags, to identify data items, and to indicate how the data items are to be processed. These tags also allow computer programs, such as parsers and web browsers, to search, sort, identify and extract data from the document. While mark-up languages make the use and interchange of data easier and more user-configurable, the addition of tags along with the data substantially increases the size of data files. This increase in file size or “bloat” can be considerable, and creates problems when data has to be transmitted quickly or stored compactly.
In accordance with the foregoing, a method and system for encoding a mark-up language document is provided. According to the invention, the structure of the mark-up language document is condensed by removing those parts of the structure that are fixed, and by expressing the variable parts of the structure in terms of which elements occur, whether elements occur, or how often certain elements occur. This may involve separating the structure of the mark-up language document from its content, and treating the structure and content differently. To encode a block of mark-up language text according to an embodiment of the invention, a template is used to determine which of the elements of the block have a fixed number of occurrences and which of the elements have a variable number of occurrences. The structure of the block is represented with a compact block of text that expresses the number of occurrences of the elements that have a variable number of occurrences, but that does not contain information regarding the elements that have a fixed number of occurrences. In various embodiments of the invention, the content of the mark-up language document is, itself, compressed by grouping similar or related data items together.
Additional features and advantages of the invention will be made apparent from the following detailed description of illustrative embodiments that proceeds with reference to the accompanying figures.
While the appended claims set forth the features of the present invention with particularity, the invention may be best understood from the following detailed description taken in conjunction with the accompanying drawings of which:
Prior to proceeding with a description of the various embodiments of the invention, a description of the computer and networking environment in which various embodiments of the invention may be practiced will now be provided. Although it is not required, the present invention may be implemented by program modules that are executed by a computer. Generally, program modules include routines, objects, components, data structures and the like that perform particular tasks or implement particular abstract data types. The term “program” as used herein may connote a single program module or multiple program modules acting in concert. The invention may be implemented on a variety of types of computers. Accordingly, the terms “device” and “computer” as used herein include personal computers (PCs), hand-held devices, multi-processor systems, microprocessor-based programmable consumer electronics, network PCs, minicomputers, mainframe computers and the like. The invention may also be employed in distributed computing environments, where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, modules may be located in both local and remote memory storage devices.
An example of a networked environment in which the invention may be used will now be described with reference to
Referring to
Computer 100 may also contain communications connections that allow the device to communicate with other devices. A communication connection is an example of a communication medium. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. The term computer readable media as used herein includes both storage media and communication media.
Computer 100 may also have input devices such as a keyboard, mouse, pen, voice input device, touch input device, etc. Output devices such as a display 118, speakers, a printer, etc. may also be included. All these devices are well known in the art and need not be discussed at length here.
As used herein, the term “mark up language” refers to any computer interpretable language that describes the structure of a document. Examples of mark-up languages include Standard Generalized Mark-up Language (SGML) and all of its variants, including Hypertext Mark-up Language (HTML), Extensible Mark-up Language (XML) and Extensible Style Sheet Language (XSL). Furthermore, the term “mark-up language document” refers to any document that contains mark-up language. Finally, the phrase “block of mark-up language” is used to indicate a mark-up language document or a portion thereof.
The invention is generally directed to a method and system for encoding a mark-up language document. According to the invention, the structure of the mark-up language document is condensed by removing those parts of the structure that are fixed, and by expressing the variable parts of the structure in terms of which elements occur, whether elements occur, or how often certain elements occur. In various embodiments of the invention, other steps may be taken to further reduce the size of the mark-up language document. These steps include one or more of: separating the data items of the document from its structure, grouping together data items with related meaning or data-type, and encoding data items in their native format.
According to various embodiments of the invention, both the sender of the mark-up language document and the receiver of the mark-up language document possess a common template that defines a default structure for a document type. This default structure includes fixed elements—those elements that are always present in fixed number of occurrences in a mark-up language document of that type. The default structure also includes variable elements—those elements that may or may not be present in a mark-up language document of that type, or are always present, but may have any number of occurrences.
For example, the mark-up language document shown in Table 1 describes a “books” element, which describes a set of books:
The document of Table 1 is written using the Extensible Mark-up Language (XML). Within the “books” element are “books” elements, each of which describes a book. Each “book” element includes a “title” element, an “authors” element, a “price” element and, optionally, an “ISBN” element. The “title” element includes the title of the book. The “authors” element describes all of the authors of the book. The “authors” element includes one or more “author” elements. Each “author” element has a “firstName” element and a “lastName” element. The “firstName” element contains the first name of the author, while the “lastName” element contains the last name of the author. Each element of the document of Table 1 is bounded by a pair of tags. For example, the first “title” element in Table 1 is <title>Wireless & Networking—I</title>. The data contained in the element is Wireless & Networking—I. The element is bounded by the tags < title>and </title>. Collectively, all of the tags of the mark-up language document of Table 1 constitute the structure of the document.
Referring to Table 2, an example of a template that that may be used for the mark-up language document of Table 1 according to an embodiment of the invention will now be described. The template is formatted as an XML Document Type Definition (DTD):
As an alternative, the template may be formatted as a schema, as shown in Table 3:
There are many other possible ways to format the template. For example, other mark-up languages or programming languages besides those shown herein may be used.
The meaning of some of the labels and terms of the DTD document of Table 2 will now be described. The label “#PCDATA” signifies parsed character data (plain text, for example). The symbols ?, + and * are used as follows:
According to an embodiment of the invention, a sender that wishes to send a mark-up language document to a receiver removes all of the tags that are associated with elements that are fixed—those elements that the sender and receiver both realize will be always be in the document and will always appear in a fixed number of instances. Furthermore, the sender need not send the tags associated with variable elements, but only needs to send information indicating how many instances of each variable element exist in the mark-up language document being sent. For example, if a sender and receiver each have a copy of the DTD document of Table 2, and they each understand that the mark-up language document that is to be sent conforms to the DTD document of Table 2, then the entire structure of the mark-up language document of Table 1 (that is, the tags) can be represented as
1(integer) 1(single bit), 2(integer) 0(single bit),
where the first number in each pair is the number of occurrences of the “author” element and the second number is a single bit that is set high if there is an “ISBN” element and low if there is not.
According to some embodiments of the invention, the data items of a mark-up language document and the structure of the mark-up language document are separated from one another. Separating the data items from the structure allows each to be processed separately. Processing the structure (the tags, for example) involves techniques such as using a template, as described above. Processing the data items (the text between pairs of tags, for example) involves techniques such as grouping data items with related meaning or type, and encoding data items in their native format.
To further illustrate the invention, an example of how the mark-up language document of Table 1 is processed according to an embodiment of the invention will now be described. First, the structure of the document (the tags, in this case) is separated from the content (the data items between the tags, in this case), as shown in Table 4:
Next, those data items that are of the same type or of similar type are grouped together, as shown in Table 5:
Note that, in this example, the titles are grouped together, the author first names are grouped together, the author last names are grouped together, and the prices are grouped together. Grouping together like data items in this manner allows the data items to be compressed more efficiently. One reason for this is that there are more likely to be duplicate data items next to one another when the data items are grouped by type. For example, Peter K. was an author of both of the books described in the mark-up language document of Table 1. After grouping like data types together (Table 5), the data item “Peter” now occurs twice consecutively. Many existing compression tools, such as GZip, can take advantage of repeated terms that are proximate to one another by simply eliminating the duplicates and replacing them with a number that represents how many times they occur. Furthermore, the data items are encoded in their native format. For example, floating point numbers are encoded as floating point numbers, even though they may have been treated as characters in the mark-up language document. Similarly, integers are encoded as integers, characters are encoded as characters, and so on.
The structure of the mark-up language document is then processed according to a template, as described above. All of those tags that are fixed—always in the document in a fixed number of instances—are removed. All of the tags that are variable are represented by the number of occurrences of the elements that they represent:
In accordance with various embodiments of the invention, the structure of a mark-up language document and the content of the mark-up language document may be compressed separately and compressed according to different compression schemes. Additionally, certain types of content may be compressed according to different compression schemes. For example, to achieve a higher compression ratio from certain applications (that use certain data types), special purpose compressors can be used. Some types of data, such as integer data or calendar dates, can be encoded in binary. In the previous example, the “price” data items could be encoded using integers. Differential or delta encoding is useful for numeric data in which there is little variation. For example, 10200, 10240, 10185, . . . would be encoded as 10200, +40, −55, −3 . . . More complex compression schemes can be applied to a variety of specialized data types, such as images, sounds, DNA sequences, and so on. Finally, each separate component of the mark-up language document—whether it is the structure, the data items, or individual groups of data items—can be further compressed using a general purpose compression tool, such as GZip or bzip2.
There are many different ways in which the various embodiments of the invention described herein may be implemented. Referring to
At step 164, the sending device 150 obtains an XML document 157 and the XML profile generated at step 160, separates the structure (the tags of the XML document) from the content (the data items between sets of tags), encodes the structure of the XML document 157 according to the profile, and optimizes the content. This encoding and optimizing process involves one or more of the techniques previously discussed in conjunction with Tables 4, 5, and 6. At step 166, the sending device 150 compresses the encoded structure and the data content according to a well-known compression scheme (GZip or Bzip2, in this example). The sending device 150 then creates a message 168 that includes the compressed and encoded structure, the compressed content, and data that identifies the profile that is to be used by the receiving device 152 when it interprets the message 168. The sending device 150 then transmits the message 168 to the receiving device 152 over the network 151.
Like the sending device 150, the receiving device 152 performs certain tasks off-line, which are represented by block 170, and performs certain tasks on-line, which are represented by block 172. During an off-line period, the receiving device 152 receives the template 155 at step 174 from the sending device 150. At step 176, the receiving device 152 then parses the template 155 to build a tree that represents an XML tag structure. The tree built by the second device 152 should look like the one built by the first device 150 at step 160. At step 178, the sending device 152 then generates an XML profile based on the tree.
During an on-line period, the second device 152 receives the message 168 and decompresses it at step 180. At step 182, the second device 152 decodes the structure portion of the message 168 using the tree created at step 176. This decoding may involve establishing a default tag structure based on the fixed elements of the XML document, and filling in the correct number of instances of the tags representing the variable elements. Finally, at step 184, the second device 152 reconstructs the XML document 157 by inserting the data items between the correct sets of tags generated at step 182.
It can thus be seen that a new a useful method and system for encoding a mark-up language document has been provided. In view of the many possible embodiments to which the principles of this invention may be applied, it should be recognized that the embodiments described herein with respect to the drawing figures is meant to be illustrative only and should not be taken as limiting the scope of invention. For example, those of skill in the art will recognize that the elements of the illustrated embodiments shown in software may be implemented in hardware and vice versa or that the illustrated embodiments can be modified in arrangement and detail without departing from the spirit of the invention. Therefore, the invention as described herein contemplates all such embodiments as may come within the scope of the following claims and equivalents thereof.
Number | Name | Date | Kind |
---|---|---|---|
5185698 | Hesse et al. | Feb 1993 | A |
5991713 | Unger et al. | Nov 1999 | A |
6635088 | Hind et al. | Oct 2003 | B1 |
6665731 | Kumar et al. | Dec 2003 | B1 |
6711740 | Moon et al. | Mar 2004 | B1 |
6725424 | Schwerdtfeger et al. | Apr 2004 | B1 |
6732109 | Lindberg et al. | May 2004 | B2 |
6782380 | Thede | Aug 2004 | B1 |
6804677 | Shadmon et al. | Oct 2004 | B2 |
6845380 | Su et al. | Jan 2005 | B2 |
6883137 | Girardot et al. | Apr 2005 | B1 |
7043686 | Maruyama et al. | May 2006 | B1 |
7089567 | Girardot et al. | Aug 2006 | B2 |
7143397 | Imaura | Nov 2006 | B2 |
7178100 | Call | Feb 2007 | B2 |
20010054172 | Tuatini | Dec 2001 | A1 |
20020026462 | Shotton et al. | Feb 2002 | A1 |
20020032706 | Perla et al. | Mar 2002 | A1 |
20020038320 | Brook | Mar 2002 | A1 |
20020065822 | Itani | May 2002 | A1 |
20020073120 | Bierbrauer et al. | Jun 2002 | A1 |
20020073236 | Helgeson et al. | Jun 2002 | A1 |
20020087596 | Lewontin | Jul 2002 | A1 |
20020107866 | Cousins et al. | Aug 2002 | A1 |
20020107887 | Cousins | Aug 2002 | A1 |
20020120598 | Shadmon et al. | Aug 2002 | A1 |
20020138517 | Mory et al. | Sep 2002 | A1 |
20020138518 | Kobayashi et al. | Sep 2002 | A1 |
20020143521 | Call | Oct 2002 | A1 |
20020156803 | Maslov et al. | Oct 2002 | A1 |
20020157023 | Callahan et al. | Oct 2002 | A1 |
20020194227 | Day et al. | Dec 2002 | A1 |
20030005169 | Perks et al. | Jan 2003 | A1 |
20030014397 | Chau et al. | Jan 2003 | A1 |
20030018466 | Imaura | Jan 2003 | A1 |
20030023628 | Girardot et al. | Jan 2003 | A1 |
20030046317 | Cseri et al. | Mar 2003 | A1 |
20030050912 | Haley | Mar 2003 | A1 |
20030167445 | Su et al. | Sep 2003 | A1 |
20040177015 | Galai et al. | Sep 2004 | A1 |
20040205613 | Li et al. | Oct 2004 | A1 |
20040205615 | Birder | Oct 2004 | A1 |
20050138483 | Hatonen et al. | Jun 2005 | A1 |
Number | Date | Country |
---|---|---|
11219308 | Mar 2002 | JP |
WO02025440 | Mar 2002 | WO |
Number | Date | Country | |
---|---|---|---|
20040003343 A1 | Jan 2004 | US |