The invention generally relates to structured data documents and more particularly to representations of metadata schemas.
Structured data documents are used for a wide variety of purposes, including, by way of examples, for databases, for electronic commerce, for graphics, and for multimedia. Some examples of structured data documents include HTML (HyperText Markup Language) documents, XML (eXtensible Markup Language) documents, scalable vector graphics (SVG) files, mp3 audio files, and MPEG-7 multimedia files.
A common feature of structured data documents is the use of schemas to describe the structure, content, and/or semantics of the documents. For example, XML documents may carry just about any kind of data. XML allows the author of a document to define his or her own tags and document structure. An XML schema, on the other hand, defines the legal building blocks of an XML document such as the elements or attributes that can appear in a document, relationships between the elements of a document, the data types of elements and attributes, and default values for elements and attributes. XML schemas are typically written in XML and support data types and namespaces. An XML schema can be reused in other schemas. It is possible to reference multiple XML schemas from a single document.
XML schemas are typically defined in plain text format and thus provide a generally software- and hardware-independent way of communicating data. The use of plain text format, however, typically means that XML documents and their related schema require significant memory and bandwidth for transmission. Additionally, because schema elements are only syntactically organized, the entire schema generally must be parsed before any part of the schema can be used, requiring significant processing time and power on the receiving end.
A compressed binary schema representation object for metadata processing and a related method are provided. The object and method provide substantial savings with respect to bandwidth requirements for transmission and processing requirements for decoding a structured data document. These teachings are presented below in the context of an XML (eXtensible Markup Language) document. Those skilled in the art will readily recognize, however, that these teachings are generally applicable to any structured data document.
An advantage of the disclosed binary schema format for metadata processing is the high compression ratio possible relative to textual formats. The binary format described below typically compresses schema size to less than 20% of the original size of the textual format. This high compression ratio significantly reduces network bandwidth requirements during transmission, storage size requirements at the receiving end, and corresponding processing during decoding. These benefits are especially helpful in mobile environments.
The disclosed binary schema format makes use of two generic data structures to implement two fundamental data types commonly used in structured data documents in general, and in XML schemas in particular. A schema is decomposed into a sequence of data structures of complex type and simple type, one after another in a linear fashion.
Another advantage of the disclosed binary schema format is that the basic data structures in the schema can be in an arbitrary sequence. An entry table is provided at the start of the data stream or file to link all of the data structures in the stream or file. The entry table acts at least to some extent like a lookup table that stores the size and offset of each data structure in the stream or file. Inside each complex type data structure, each child element has an index number assigned to it, providing a way to quickly traverse the entire schema tree without performing any searching operation. Additionally, the schema type information is purposely placed at the start of each data structure so that any type matching and verification is facilitated. A decoder using a binary schema object can easily locate the relevant schema fragment and load it into memory prior to using it for selectively decoding a portion of a metadata stream. The same decoder using a textual schema would have to load the entire textual schema and perform full validation and parsing against an incoming instance of the schema.
Another advantage of the disclosed binary schema format is the incorporation of the schema namespace into the binary schema format. This allows a current binary schema table to reference external schema types. A flag consisting of a single bit may be used to indicate whether a schema type is external, in which case the corresponding namespace will follow. The use of this flag eliminates the need to specify the namespace for every internal schema type, thus providing further efficiency.
According to the teachings of this description, the textual-based XML schemas 102 are processed on the server side 100 by a binary schema formatter 106 which converts the XML schemas 102 to binary form 107 prior to transmission. In binary form, all schema elements are linked as a lookup table as described below and compressed individually into an efficient binary structure. An MPEG-7 BiM encoder 108 can use the binary schemas directly and the compressed binary schemas 107 can be efficiently transferred over a network to the client side 103. On the client side 103, an MPEG-7 BiM decoder 108 can dynamically reconfigure itself with the received binary schemas 109 that may be stored in a binary schema cache 110. Because the binary schemas 107 have been pre-processed on the server side 100, they do not require an XML parser on the client side 103, thus saving processing time and power on the client side 103. Those skilled in the art will appreciate that these savings, both in terms of reduced bandwidth requirements for the transmission and processing requirements at the client, make these teachings especially suitable for mobile environments.
Alternatively, the binary schemas 109 could be generated off-line and loaded on the client side 103 for selection by either the server or some automated or semi-automated method. In the case where the schema is selected by a server, the specific binary schema ID can be conveyed by the BiM stream 104.
A given XML schema can be decomposed into a sequence comprised of two basic data types, simple types (
The binary schema encoding process 300 in this illustrative embodiment follows the flow chart of
Referring now to
As shown in
The namespace 402 is a special data structure following the schema entry table 401. It specifies the namespace for the binary schema table 400. The namespace may be coded as a character string, whose size and offset are specified in entry 0 of the schema entry table 401.
Even though the schema types are arranged sequentially inside the schema table 400, they are all linked together to form a reversed tree-like structure. At the top of the structure, there are one or more roots. These roots are known as the global entries of the schema table and serve as entry points to the schema table. The root table 403 is the second data structure after the schema entry table 401. An illustrative binary encoding example of a root table 403 is shown in Table 1 below. It starts with the number of root entries in the schema table, followed by the index number of each root entry referencing the entry in the schema entry table. The root table 403 also includes the names of the global elements.
Note that vluimsbf8 is the designation for a variable length code unsigned integer with its most significant bit first. The size of vluimsbf8 is a multiple of one byte. The first bit of each byte specifies if set to 1 that another byte is present for this vluimsbf8 code word. The unsigned integer is encoded by the concatenation of the seven least significant bits of each byte belonging to this vluimsbf8 code word.
As noted earlier, a complex type element is depicted in
An illustrative example for binary encoding of a complex type header is shown in Table 2 below.
A complex type child element starts with an element count 212 followed by a sequence of element units 213. An illustrative example of binary encoding of an element unit is depicted in Table 3 below. The values of minOccurs and maxOccurs are encoded using 7 bits each and, thus, can accommodate 0 through 126 occurrences. When all 7 bits of maxOccurs are set to 1, maxOccurs is understood to be “unbounded.”
When the simple-element flag is set, the element carries inline content only. In the case that the inline content is of primitive type, the external flag will be reset and the primitive-type flag will be set.
The location of an element type can be found in the same schema table via the entry index if the external flag is not set or in a different table whose namespace follows if the external flag is set. The use of entry Index allows an application to quickly get the child element type from the same schema table without performing string matching.
An attribute group consists of an attribute count followed by one or more attribute units. An illustrative example of binary encoding for an attribute unit is shown in Table 4 below.
Each attribute unit contains three key flags: a use flag, a default flag, and a fixed flag. The use flag is encoded in 2 bits: 0b00 indicates optional; 0b01 indicates required; 0b10 indicates prohibited; and 0b 11 is reserved. The default flag indicates whether a default value will be specified for the attribute. The fixed flag indicates whether a fixed value will be specified for this attribute.
The attribute unit also contains an external flag that indicates the location of the attribute type. If it is external, a namespace will be specified for the external schema. If it is internal, an entry index will be given to locate the corresponding type in the same schema table.
If the attribute is a primitive type, the primitive flag will be set and the name of the type will be specified by the value of attribute type. Because it is assumed that the BiM decoder has knowledge of all primitive types, the schema table in this embodiment need not include any information for the primitive types.
A type cast group consists of a type cast count followed by a group of type cast units. An illustrative example of binary encoding for type cast units is shown in Table 5 below.
Simple types, one of the two basic data types, are depicted in
An illustrative example of binary encoding for a simple type unit is shown in Table 7 below.
By one approach, a simple type belongs to one of three categories, known as group types. In this illustrative example a group type is encoded in 2 bits: 0b00 indicates atomic; 0b01 indicates union; 0b10 indicates list; and 0b11 is reserved. The facet flag indicates whether a facet is specified for the simple type. The union flag indicates whether a member type array is specified for the simple type. The list flag indicates whether an item type is specified for the simple type.
A string array carries an array of character strings. An illustrative example of binary encoding for a string array is shown in Table 8 below. A string array can be used for a member type array in a simple type unit, or for an enumeration array in a facet.
A facet 204 may be used to specify restrictions for a simple type unit 202. An illustrative example of binary encoding for a facet 204 is shown in Table 9 below. In this example the value of white space is encoded in 2 bits: 0b00 indicates preserve; 0b01 indicates replace; 0b10 indicates collapse; and 0b11 is reserved.
An illustrative method of using a compressed schema representation object such as the one described herein is depicted in
One example of a kind of structured data document that might be thus provided is an XML document. The types 507 could then comprise complex types and simple types as described above. Further, the relational links could define a hierarchical relationship for the plurality of types.
It can be desirable in some circumstances, for example in mobile environments, to compress the non-hierarchically ordered elements before transmission. The reduction in size resulting from compression provides efficiencies in bandwidth usage and in processing time by the receiver. Further efficiencies can be achieved if at least some of the elements are individually compressed, thus enabling individual selection and decompression of the elements.
An illustrative example of a method 600 of using a schema object, such as the one described herein, for receiving information is depicted in
A record of at least some of the received elements is made 606 at the receiving end. A desired schema formatting type is identified 607, and elements as correspond to the desired schema formatting type are recovered 608. If the elements are in a compressed format, the recovery process 608 may comprise querying the schema 609 and decompressing the desired schema formatting type 610. The desired schema formatting type can optionally be decompressed separately from others of the received elements.
A relational link corresponding to the recovered element may be used to identify at least another one of the elements to be automatically recovered 611. If a structured data document was received 605, it may be processed 612 using schema formatting information recovered 608.
An illustrative example of an apparatus for receiving and processing a schema representation object is depicted in
A processor 707 is operably coupled to the memory 702. The processor is configured and arranged to: identify a desired schema formatting type 705; recover a given one of the plurality of substantially non-hierarchically order elements 704 as corresponds to the desired schema formatting type 705; and use at least one relational link 706 to identify at least one other one of the plurality of substantially non-hierarchically ordered elements 704 to be automatically recovered.
The teachings herein provide several technical benefits in the field of metadata processing. A binary schema object will generally be more efficient than its textual counterpart. Even in situations where the sizes of the textual and binary schemas may be comparable, a binary schema object should outperform it textual counterpart in operation. A decoder using a binary schema object can search the relevant schema fragment and load it into memory prior to using it for selectively decoding a portion of a metadata bit stream. The same decoder, using a textual schema, would have to load the entire textual schema and perform a full validation and parsing against an incoming instance of the schema, consuming considerably more time, memory, and processing power than the equivalent binary schema.
An important benefit derived from the teachings herein is compatibility between binary metadata schema objects used in binary metadata decoding. The binary metadata object is a binary format that is independent of the client's software. Therefore, it can be understood by any client capable of reading the format. In terms of schema compatibility, if a device finds itself operating in the presence of more than one schema, it can simply switch to a different binary metadata schema object from several of these available, possible in memory, secondary storage, or received from a remote server during decoder set up. The binary metadata schema object can also guarantee compatibility between data models in a content management system, since the schema database can be dynamically configured to match the schema being used.
The benefits of the teachings herein are especially important in mobile communications scenarios. Since binary metadata decoding relies heavily on schema information, defining an efficient binary schema format that directly serves the requirements of a binary metadata decoder removes the most important bottleneck toward efficient operation.
The foregoing relates to exemplary embodiments of the invention. It is understood that other embodiments and variants are possible which lie within the spirit and scope of the invention as set forth in the following claims.