1. Field of the Invention
The present invention relates to the field of processing structured documents. More particularly, the present invention relates to a processing framework that enables generation of events corresponding to instance document elements and events corresponding to definition components in a single serial process.
2. Description of Prior Art
The extensible markup language (XML) is a standard used to organize and tag elements of a document so that the document can be transmitted and interpreted between applications and organizations. XML has become an increasingly popular standard for communicating documents via the World Wide Web because XML can be used to describe not only how to display information in a document, but also to describe the meaning and content of information in the document.
Software operable to process XML documents follow an XML schema associated with each document. An XML schema is a set of rules, or data model, to which an XML document must conform, and XML schemas can vary from document to document. A schema may identify, for example, types of data fields contained in a document and the relationship between those data fields.
Processing XML instance documents typically entails parsing a document according to the schema definition and generating events associated with various elements of the document. Such events may include a start event when an element is first encountered and an end event when the software encounters an end tag associated with the element.
Unfortunately, existing XML processors are limited in that they generate events relating only to information contained in the document, and do not provide the flexibility needed by users who wish to employ a more robust event model that extends beyond the contents of a particular XML instance document.
Accordingly, there is a need for an improved method of processing XML documents that does not suffer from the problems and limitations of the prior art.
The present invention provides a system and method of processing structured documents. More particularly, the present invention relates to a processing framework that enables generation of events corresponding to instance document elements and events corresponding to definition components in a single serial process.
A first embodiment of the invention is a computer program product comprising a computer usable medium including computer usable program code for processing a document that is structured according to a document definition.
The computer program product comprises computer usable program code for receiving event information corresponding to a first element of information within the document, for identifying a portion of the document definition describing the first element of information, and for generating a first event corresponding to the first element and for generating a second event corresponding to the portion of the document definition describing the first element, wherein the first event and the second event are generated as part of a single, serial process.
In a second embodiment, the computer program product comprises computer usable program code for identifying a first element of information within the document according to the document definition, for identifying a portion of the document definition describing the first element of information, and for generating a first event corresponding to the first element and a second event corresponding to the portion of the document definition describing the first element, wherein the first event and the second event are generated as part of a single, serial process.
In a third embodiment, the computer program product comprises computer useable program code for processing an XML schema and for processing event information relating to an instance X document according to the XML schema. The computer useable program code for processing an XML schema includes code for identifying a first element declaration of an XML schema, for creating a graph data structure with a node corresponding to the first element declaration, for identifying a plurality of schema components (including model groups) and a second element declaration, wherein the plurality of schema components and the second element declaration relate to the first element declaration, and for identifying a relationship between the first element declaration, the plurality of schema components, and the second element declaration, and for computer usable program code for creating a node in the graph corresponding to each of the schema components and a node corresponding to the second element declaration, wherein the relationship is reflected in the graph.
The computer useable program code for processing event information relating to an instance XML document according to the XML schema includes code for receiving event information corresponding to a first element of an XML instance document corresponding to the first element declaration in the corresponding entity that defines each such element, for receiving event information corresponding to a second element of the XML instance document corresponding to the second element declaration, for identifying a path in the graph from the first element declaration to the second element declaration, wherein the path may include one or more of the plurality of nodes corresponding to the schema components, and for traversing the path, generating a start event corresponding to a child node when traversal is from a parent node to the child node, and for generating an end event corresponding to the child node when traversal is from the child node to the parent node.
These and other important aspects of the present invention are described in more fully in the detailed description below.
An embodiment of the present invention is described in detail below with reference to the attached drawing figures, wherein:
The present invention provides a method for augmenting the functionality of a structured document processor, such as an XML parser, by generating events corresponding to definition components in addition to events generated by the document processor corresponding to document elements. The events corresponding to document elements and the events corresponding to definition components are generated in a single, serial process.
The method of the present invention is especially well-suited for implementation on a computer or computer network, such as the computer 10 illustrated in
The present invention can be implemented in hardware, software, firmware, or a combination thereof. In a preferred embodiment, however, the invention is implemented with a computer program. The computer program and equipment described herein are merely examples of a program and equipment that may be used to implement the present invention and may be replaced with other software and computer equipment without departing from the scope of the present invention.
The computer program of the present invention is stored in or on a computer-usable medium, such as a computer-readable medium, residing on or accessible by a host computer for instructing the host computer to implement the method of the present invention as described herein. The host computer may be a server computer, such as server computer 24, or a network client computer, such as computer 10. The computer program preferably comprises an ordered listing of executable instructions for implementing logical functions in the host computer and other computing devices coupled with the host computer. The computer program can be embodied in any computer-usable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device, and execute the instructions.
The ordered listing of executable instructions comprising the computer program of the present invention will hereinafter be referred to simply as “the program” or “the computer program.” It will be understood by those skilled in the art that the program may comprise a single list of executable instructions or two or more separate lists, and may be stored on a single computer-usable medium or multiple distinct media. The program may also be described as comprising various code segments, which may include one or more lists, or portions of lists, of executable instructions. Code segments may include overlapping lists of executable instructions—that is, a first code segment may include instruction lists A and B, and a second code segment may include instruction lists B and C.
In the context of this application, a “computer usable medium” or a “computer readable medium” can be any means that can contain, store, communicate, propagate or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer-usable medium can be, for example, but is not limited to, an electronic, magnetic, optical, electro-magnetic, infrared, or semi-conductor system, apparatus, device, or propagation medium. More specific, although not inclusive, examples of the computer usable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a random access memory (RAM), a read-only memory (ROM), an erasable, programmable, read-only memory (EPROM or Flash memory), an optical fiber, and a portable compact disk read-only memory (CDROM). The computer-usable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance, optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
As explained above in the section titled “DESCRIPTION OF THE PRIOR ART,” XML is a standard used to organize and tag elements of a document so that the document can be transmitted and interpreted between applications and organizations. Use of XML involves XML schemas and XML instance documents. An XML schema is a set of rules, or data model, to which an XML document must conform. An XML instance document is a data object conforming to, for example, the World Wide Web Consortium (“W3C”) recommendation at http://www.w3.org/TR/REC-xml or a previous version of that recommendation. XML instance documents and XML schemas are specific examples of structured documents and document definitions, respectively.
Examples of documents include, but are not limited to, an electronic file residing in a computer memory or a storage device, a collection of data elements received via an electronic communications medium, and a paper or other tangible medium containing printing or other indicia of data or information.
Various definition components specified by W3C recommendation http://www.w3.org/TR/xmlschema-1/ are illustrated in
A complex type has a component called a particle 40, which enforces cardinality constraints—minimum (“minOccurs”) and maximum (“maxOccurs”) occurrences of an element. The particle has a property called “term” 42, which can be a wildcard, element declaration, or a model group. If the term is an element declaration, it can be either complex 36 or simple 38. The term may alternatively be a model group, which defines how content will be laid out—sequence, choice, or all. Items of a sequence model group must appear in a particular sequence, while any one item of a choice model group may appear. The content model “all” implies that all the items of the content model can appear in any order. The model group 44 may contain many particles, wherein each particle of the model group enforces cardinality constraints on an individual item of the model group through minOccurs and maxOccurs properties. This allows an infinite depth recursion of model group, particles, and element declarations.
An exemplary XML schema is illustrated in
In processing the schema, the program first loads an XML schema, as depicted in block 50. This step may involve, for example, reading a “.xsd” file or similar document containing an XML schema, such as the schema shown in
The program then identifies a definition component or an embedded child element declaration, as depicted in block 56. In this step, the program determines to which global element declaration each child element declaration is related, and to which (if any) element declaration each definition component is related. The program then creates a graph node corresponding to the definition component or child element declaration, as depicted in block 58. The node is placed in the graph to reflect a relationship between the definition component and other components and element declarations in the schema definition. For example, if the definition component is a particle component of a global element declaration, the node corresponding to the component is a child of the node corresponding to the global element declaration.
The program determines whether there are any more definition components or child element declarations to process, as depicted in block 60. If so, the program identifies the next definition component or child element declaration, as depicted in block 56. If not, the program determines whether there are any more global element declarations to process, as depicted in block 62. If so, the program identifies the next global element declaration, as depicted in block 52, and processes any components or child element declarations corresponding to the global element declaration. If there are no more global element declarations to process, the program has completed analyzing the XML schema and may use the graph corresponding to the definition to analyze an XML instance document that conforms to the XML schema definition.
The above-described method of processing an XML schema may be used or adapted to process virtually any structured document definition. However, the method will now be described more particularly in terms of the XML schema to illustrate one particular implementation. The program identifies a global element declaration, as explained above. If the global element is declared to be of complex type, a corresponding “Complex Type” node is created and attached to the global element node in the graph. The complex type definition is processed, whereby cardinality information (if the complex type has none-empty element content) and/or attribute information are discovered and a corresponding “Particle” node, “Attribute Use” node, or both are created and attached to the Complex Type node in the graph. The particle contains a model group, resulting in the creation of a “Model Group” node, attached to the Particle node. Processing of the model group leads to the discovery of one or more element declarations, which are processed in the same manner as global element declarations, except that the element nodes created are attached to the Model Group node in the graph, rather than being a root node. The processing of schema components continues in the same manner for all descendant element declarations. Then the program repeats the procedure on the next global element declaration, resulting in another graph. The analysis of the XML schema is complete when all global element declarations are processed.
An exemplary method of processing an XML instance document is illustrated in
The program invokes a conventional XML parser, passing the instance document and the XML schema as input. The XML parser sends a “startElement” event for the root element, as depicted in block 66, and the program identifies the correct schema graph to use for the rest of the processing of the instance document based on the startElement, as depicted in block 70. The root node of this graph becomes the current node. For each subsequent start element event received from the parser and corresponding to element Ei, a path Pi from the current node to the node corresponding to Ei in the schema graph is calculated, as depicted in block 72. The program then traverses the path Pi, as depicted in block 74, wherein the nature of path Pi is such that traversal down the graph (from parent to child) causes a start event corresponding to the child node, and traversal up the graph (from child to parent) causes an end event for the child node. Only one element node (corresponding to element Ei) is in the path Pi. The program receives end event information corresponding to either the global element or the child element, as depicted in block 76. Upon receiving the end event information, the program traverses the path in reverse direction, generating an end event corresponding to a child node each time it moves from the child node to a parent node. Furthermore, this traversal should not cause violation of any rules described below for the Particle node, Model Group node, or a Wildcard node.
A particle node is started in response to the start of an element that is its descendant in the XML component hierarchy (see
The following events mark the end of a particle node: 1) the end of an element that is an ancestor of the particle node in the XML component hierarchy (the element causing an end event for the particle must be the closest element ancestor of the particle node); 2) the start of an element that is not a descendant of the particle in question; and 3) the start of a child node after the occurrences property of the particle node is equal to the maximum allowed.
The start of a model group depends on the type of model group. If the model group is of type sequence, the start event occurs in response to the start of a child node in the XML schema component hierarchy. Note that the child is either an element node or a particle node. In a model group of type sequence, however, the start of a child node does not always result in the start of the model group. With reference to the data structure of
The end of a model group can also be triggered by the start of an element that is not a descendant of the model group in the XML component hierarchy, which essentially triggers the end of the model group's parent particle node.
If the model group is of type choice, the start of each child node causes the start of the model group. The end of each instance of child node causes the end of the current instance of the model group.
If the model group is of type all, the start of any child node causes the start of this model group. However, the model group is not started for each occurrence of the child node; the start of the model group happens only for the first occurrence of any of its child nodes.
The end of a model group of type all is triggered by the start of an element that is not a descendant of the model group in the XML component hierarchy, which essentially triggers the end of the model group's parent particle node.
Special rules may also be employed to determine the beginning and end of wildcard nodes. While wildcard nodes were not processed as part of the example set forth above, it is possible to have such a node for a schema that contains wildcard declarations. The start of a wildcard node is triggered by the start of an element such that its qualified name matches the specified wildcard constraints. The end of element corresponding to the element in the XML document that caused the start of wildcard marks the end of wildcard node.
An advantage of the present invention over the prior art is that events corresponding to XML instance document elements and schema definition components are generated in a single, serial process without the need to perform two separate processes. The program is operable to generate both start events and end events corresponding to each element and each definition component.
Appendix A presents a table of events corresponding to only information encountered in an XML instance document, namely elements and attributes associated with the elements. Appendix B presents an exemplary table of events generated in a process that considers information from the schema definition of
As described above, the invention is operable to cooperate with conventional XML processing software to generate events corresponding to schema components in addition to the events generated by the conventional XML processing software that correspond to instance document elements.
An advantage of the present invention over the prior art is that events corresponding to XML instance document elements and schema definition components are generated in a single, serial process without the need to perform two separate processes. The program is operable to generate both start events and end events corresponding to each element and each definition component. As depicted in block 126, the program generates a start event corresponding to a child node when traversing from a parent node to the child node. When traversing from a global element declaration, for example, to an element type component, the program generates a start event for the element type component wherein the type (i.e., complex or simple) is handled as an event parameter. The program generates an end event corresponding to the child node when traversing from the child node to the parent node, as depicted in block 128.