1. Field of the Invention
The present invention relates generally to an improved data processing system and in particular to a method and apparatus for validating code. More particularly, the present invention relates to a computer implemented method, apparatus, and a computer usable program product for validating an XML (extensible markup language) document against an XML schema.
2. Description of the Related Art
Many web pages on the Internet today are written in structured languages. The structured language is a programming language when the program may be broken down into blocks or procedures, which can be written without detailed knowledge of the interworkings of other blocks, thus allowing a top-down design approach. Examples of structured languages include extensible markup language (XML), hypertext markup language, extended hypertext markup language and many others. Additionally, structured languages include languages that are based on these other languages. For example, languages such as RSS, math ML, graph ML, scaleable vector graphics, music XML, and others. Thus, structured languages are a very common source of computer programming.
Documents drafted in markup language or a structured language are often validated in order to ensure that the document is free of errors and will perform according to its intended use. When validating a structured language document, often the document is compared to a particular schema. For example, an XML document that complies with a particular schema, in addition to be well formed, is said to be valid. In another example, an XML schema is a description of an XML document typically expressed in the terms of constraints and structure of contents of documents of that type, above and beyond the basic constraints composed by XML itself. A number of standard and proprietary XML schema languages exist for the purpose of formally expressing such schemas. Some of these languages are XML based themselves. Examples of schemas for XML include document type definition (DTD), XML schema definition (XSD), W3C XML schema (WXS), RELAX NG, document schema description languages (DSDL), and others.
The process of validating structured language documents can take a considerable amount of time, particularly, when many documents are to be validated or when a particular document is very long. Thus, efforts have been made to improve the process of validating structured language documents. In the case of XML, documents are parsed and compared against a particular schema. Most traditional XML parsers such as the Apache Xerces-J and Xerces-C parsers scan and validate XML documents in two distinct phases. In Xerces-C, the scanner examines each tag name and item of text context for well-formedness, then presents each tag name and item of text context to validation componentry if validation is enabled for the document in question. The scanner then presents the data to an application program interface (API) generator, if the validation component returns an indication that the data is valid. In Xerces-J, a pipeline architecture used for a validation component may optionally be plugged between the scanning component and the API generator. However, in neither of these architectures is any knowledge of the grammar against which the document is being validated used to assist scanning of the tokens comprising the document. Additionally, similarities between documents processed by a given parser are not used to speed up parsing.
The illustrative embodiments described herein provide for a method for validating a target document written in a structured language against a schema for the structured language. A record of document fragments that have been previously validated against the schema is maintained. The target document is compared to the document fragments to identify portions of the target document that are schematically identical to corresponding document fragments. Validation is omitted for at least one of the portions of the target document that are schematically identical to the corresponding document fragments when validating the target document.
In another illustrative example, the method further includes adding to the record of document fragments, after successful validation of the target document, at least one portion of the target document that was not schematically identical to any document fragments in the record of document fragments.
Another illustrative example, provides for a method for validating a target document written in a structured language against a schema for the structured language. A first part of the target document is compared to a document fragment, wherein the document fragment was previously validated against the schema. Responsive to the first part of the target document matching the document fragment, validation of the first part of the target document is omitted.
In another illustrative example, the method further includes, responsive to the first part of the target document failing to match the document fragment, validating the first part of the target document.
In another illustrative example, the target document comprises a plurality of additional document fragments, wherein each of the plurality of additional document fragments were previously validated against the schema. In this case wherein the method further includes, responsive to the first part of the target document matching any of the plurality of additional document fragments, omitting validation of the first part of the target document. Responsive to the first part of the target document failing to match both the document fragment and all of the plurality of additional document fragments, the first part of the target document is validated.
In another illustrative example, the first part of the target document comprises less than all of the target document.
In another illustrative example, the document fragment is a second part of the target document.
In another illustrative example, the method further includes generating the document fragment by successfully validating the second part of the target document against the schema and then storing the second part of the target document as the document fragment.
In another illustrative example, the method further includes parsing the target document into the first part of the target document. In this case, the first part of the target document is a scanner event. The scanner event is transmitted to an event queue.
In another illustrative example, the scanner event comprises at least one of a start tag, a text content, a white space, and an end tag.
In another illustrative example, the method further includes transmitting the scanner event to a virtual machine and performing a comparison in the virtual machine.
In another illustrative example, the method further includes requesting an automaton processor to create a new state node and transmitting at least one object to the automaton processor.
In another illustrative example, the at least object is selected from the group consisting of a reference to an associated instruction in a byte code, a byte array, a scanner context, and a virtual machine context.
In another illustrative example, the scanner context comprises at least one of a namespace, an element stack, and a symbol table.
In another illustrative example, the virtual machine context enables the virtual machine to validate a corresponding portion of a subsequent part of the target document.
In another illustrative example, the target document comprises an extensible markup language document and wherein the schema comprises an extensible markup language schema.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
With reference now to the figures and in particular with reference to
Computer 100 may be any suitable computer, such as an IBM® eServer™ computer or IntelliStation® computer, which are products of International Business Machines Corporation, located in Armonk, N.Y. Although the depicted representation shows a personal computer, other embodiments may be implemented in other types of data processing systems. For example, other embodiments may be implemented in a network computer. Computer 100 also preferably includes a graphical user interface (GUI) that may be implemented by means of systems software residing in computer readable media in operation within computer 100.
Next,
In the depicted example, data processing system 200 employs a hub architecture including a north bridge and memory controller hub (NB/MCH) 202 and a south bridge and input/output (I/O) controller hub (SB/ICH) 204. Processing unit 206, main memory 208, and graphics processor 210 are coupled to north bridge and memory controller hub 202. Processing unit 206 may contain one or more processors and even may be implemented using one or more heterogeneous processor systems. Graphics processor 210 may be coupled to the NB/MCH through an accelerated graphics port (AGP), for example.
In the depicted example, local area network (LAN) adapter 212 is coupled to south bridge and I/O controller hub 204, audio adapter 216, keyboard and mouse adapter 220, modem 222, read only memory (ROM) 224, universal serial bus (USB) and other ports 232. PCI/PCIe devices 234 are coupled to south bridge and I/O controller hub 204 through bus 238. Hard disk drive (HDD) 226 and CD-ROM 230 are coupled to south bridge and I/O controller hub 204 through bus 240.
PCI/PCIe devices may include, for example, Ethernet adapters, add-in cards, and PC cards for notebook computers. PCI uses a card bus controller, while PCIe does not. ROM 224 may be, for example, a flash binary input/output system (BIOS). Hard disk drive 226 and CD-ROM 230 may use, for example, an integrated drive electronics (IDE) or serial advanced technology attachment (SATA) interface. A super I/O (SIO) device 236 may be coupled to south bridge and I/O controller hub 204.
An operating system runs on processing unit 206. This operating system coordinates and controls various components within data processing system 200 in
Instructions for the operating system, the object-oriented programming system, and applications or programs are located on storage devices, such as hard disk drive 226. These instructions and may be loaded into main memory 208 for execution by processing unit 206. The processes of the illustrative embodiments may be performed by processing unit 206 using computer implemented instructions, which may be located in a memory. An example of a memory is main memory 208, read only memory 224, or in one or more peripheral devices.
The hardware shown in
The systems and components shown in
Other components shown in
The depicted examples in
The illustrative embodiments described herein provide for a method, apparatus, and computer usable program product for validating an XML (extensible markup language) document against an XML schema. However, the methods and devices described herein can be applied to other schema languages and other structured language documents. Examples of other schema languages to which the methods and devices described herein can be applied include DTD and RELAX NG, though many other structured language schemas can be used with the methods and devices described herein.
For example, an illustrative embodiment provides a method for validating a target document written in a structured language against a schema for the structured language. According to this illustrative method, a record of document fragments that have been previously validated against the schema is maintained. The document fragment is a portion of a document written in a structured language. The method also includes comparing the target document to the document fragments to identify portions of the target document that are schematically identical to corresponding document fragments. The term schematically identical means sufficiently similar in structure to a known fragment to be confident that if the known fragment is valid according to the schema, then the portion is also valid according to the schema, even if certain informational content is different. The exemplary method also includes omitting validation for at least one of the portions of the target document that are schematically identical to the corresponding document fragments when validating the target document.
After successful validation of the target document, at least one portion of the target document that was not schematically identical to the corresponding document fragments are added to the record of the document fragments that is maintained. Thus, in this illustrative example, only those portions of a structured language document that have not previously been validated are validated. Those portions of the structured language document that have already been validated are not validated. In this way, a more efficient schema for validating target documents is presented.
Another illustrative method for validating a target document written in a structured language against a schema for a structured language is to first compare a first part of the target document to a document fragment. The document fragment was previously validated against the schema. Then, responsive to the first part of the target document matching the document fragment, validation of the first part of the target document is omitted. However, if the first part of the target document fails to match the target document, then the first part of the target document is instead validated.
Generally, the target document can include a great number of additional document fragments. Each of the additional document fragments were previously validated against the schema. In this case, an exemplary method includes, responsive to the first part of the target document matching any of the plurality of additional document fragments, omitting validation of the first part of the target document. However, responsive to the first part of the target document failing to match both the document fragment and also failing to match any of the plurality of additional document fragments, the first part of the target document is validated.
These exemplary methods can be modified or expanded. For example, in an illustrative example, the first part of the target document is less than all of the target document. In another illustrative embodiment, the document fragment is a second part of the target document. In this illustrative embodiment, as a particular document is parsed and validated, those parts of the document that are similar to the previously validated parts are not further validated.
Thus, the illustrative embodiments described herein can be used to efficiently parse single documents as well as new documents, and compare such documents against older schemas. As new document fragments are validated by the illustrative methods described herein, the newly validated document fragments are stored so that they may be used to compare against additional document fragments parsed from the same or other target documents.
First, a given XML schema 300 is compiled into individual byte code 302. An XML schema is a description of an XML document typically expressed in the terms of constraints and structures of contents of documents of that type. The constraints and structures of the documents can be above and beyond the basic constraints and structures imposed by XML itself. Byte code 302 contains a collection of instructions. Validation engine 304 interprets these instructions one by one by parsing XML document 306. Because an instruction validates a subject part of XML document 306, the validation can succeed only when all invoked instructions have succeeded.
The output of validation engine 304 is validation result 308. Validation result 308 usually takes the form of an indication that the target part of XML document 306 is valid, or that the target part of XML document 306 is invalid. Assuming that the target part of XML document 306 is valid, then that part of the document is stored in automaton repository 310.
As additional parts of XML document 306 are compared to and validated against instructions in byte code 302 using validation engine 304, these additional validated parts of XML document 306 are stored in automaton repository 310. Thus, automaton repository 310 contains or stores one or more portions of XML document 306 which have previously been validated. These validated portions of XML document 306 can then be used when validating other portions of XML document 306 and also when validating other XML documents.
Exemplary validation engine 400 shown in
In an illustrative example, scanner 402 first invokes automaton processor 404. Each automaton node corresponds to a begin tag, an end tag, an empty tag where a text node of the XML documents have been processed, or some other tag or component of the XML document. The XML document can be XML document 306 in
Therefore, when processing new XML documents an automaton can be traversed by automaton processor 404 by performing pattern matching. During pattern matching, execution of some of the instructions can be skipped. Thus, validation engine 400 shown in
In an illustrative example, when a new document is processed, an automaton is constructed. First, scanner 402 parses an XML document and checks the well-formedness of resulting XML fragments in order to produce scanner event 412. Scanner event 412 is a data structure that represents XML document fragment 410, which is a portion of the XML document. Scanner events 412 can include the start tag, text content, end tag, a white space, and other portions of an XML document. Subsequently, scanner event 412 is stored in event queue 406 as shown in
Thus, event queue 406 includes one or more scanner events 412 generated by scanner 402. Virtual machine 408 receives scanner event 412 from event queue 406. Event queue 406 can transmit scanner events 412 to virtual machine 408, or virtual machine 408 can fetch scanner event 412 from event queue 406. In either case, the term transmitted can be used to describe transferring scanner event 412 to virtual machine 408.
Virtual machine 408 performs validation by executing instruction 407 of an XML schema over scanner event 412. This process repeats until all scanner events are consumed by virtual machine 408. This process may involve reconfiguring scanner 402 so that scanner 402 can optimally process subsequent content.
After validating XML fragment 410 using instruction 407, virtual machine 408 requests automaton processor 404 to create a new state node. Additionally, virtual machine 408 passes four objects through automaton processor 404. These objects include reference 414, which is a reference to the associated instruction in the byte code, byte array 416, scanner context 418, and virtual machine context 420. Reference 414 is stored by partial validation for later usage. Byte array 416 represents the XML fragment at the byte level, with which the automaton processor will compare the XML fragment 410 with contents it has previously parsed. Scanner context 418 is used later in the process of virtual machine 400 so that scanner 402 can start parsing from the intermediate point. Scanner context 418 includes a number of elements such as, but not limited to name space 422, element stack 424, and symbol table 426. Additionally, virtual machine context 420 is an object that enables virtual machine 408 to validate a corresponding portion of the subsequent XML document fragment.
Although operation of validation engine 400 shown in
In the illustrative examples shown in
These instruction codes are validated against XML document fragments including <aaa> 520, <bbb> 522, ccc 524, </bbb> 526, </aaa> 528, and possibly other document fragments as indicated by ellipses 530. In the illustrative examples shown in
The exemplary automaton 500 represented by nodes 532-540 can be used by a virtual machine, such as that shown in
As shown in
In particular, the process shown in
In the illustrative examples shown in
In the process shown in
The next byte array to be processed is <xxx> 616. As the automaton processor cannot find any state representing this byte array, partial parsing will be started by the scanner. In order to partially parse from the intermediate fragment in XML document 612, the scanner loads the scanner context from the previous state representing <aaa> 614. For example, the scanner loads scanner context 418 shown in
This process is repeated with respect to XML document fragment zzz 618, XML document fragment </xxx> 620, </aaa> 622, and any other XML document fragments 624 that are different from previously validated XML document fragments. In this way, the corresponding instructions, such as instruction X 626, instruction Z 628, and instruction X1630 are parsed and processed.
XML schema 700 allows three elements, title 702, category 704, and comment 706 in sequential order under book element 708. Based on XML schema 700 shown in
Automaton 900 includes a number of state nodes, including state node 902, state node 904, state node 906, state node 908, state node 910, state node 912, state node 914, state node 916, state node 918, and state node 920. Each state node includes one or more input characters that are consumed by that state node. Thus, for example, input characters 922 corresponding to state node 902 are the characters <books>. Input characters 922 are consumed by the corresponding state node 902. Similarly input characters 924 are consumed by state node 904; input characters 926 are consumed by state node 906; input characters 928 are consumed by state node 908; input characters 930 are consumed by state node 910; input characters 932 are consumed by state node 912; input characters 934 are consumed by state node 914; input characters 936 are consumed by state node 916; input characters 938 are consumed by state node 918; and input characters 940 are consumed by state node 920. Additionally, each state node shown in
Additionally, each state node shown in
The arrows shown in
As can be seen, XML document fragment 1000 is similar to XML document fragment 800 shown in
Because elements books 1008, book 1010, and title 1002 shown in
Automaton 1100 is similar to automaton 900 shown in
According to the illustrated embodiments described herein, state nodes 1102, 1104, 1106, 1110, 1112, 1116, 1124, and 1126 are schematically identical to corresponding state nodes 902, 904, 906, 910, 912, 916, 918, and 920 in
The state nodes shown in
Similarly, new state nodes shown in
The process begins as the virtual machine of the validation engine compiles an XML schema definition into byte code containing a set of instructions (step 1200). The virtual machine interprets an instruction in the set of instructions (step 1202). The virtual machine compares the instruction to part of the XML document (step 1204). The virtual machine then determines whether part of the XML document has been validated already (step 1206).
If the part of the XML document has not been validated already (“no” response to step 1206), the virtual machine validates that part of the XML document (step 1208). The virtual machine then stores the validated part of the XML document (step 1210). If the part of the XML document has already been validated (“yes” response to step 1206), then steps 1208 and 1210 are omitted and the virtual machine proceeds directly step 1212. The virtual machine then determines whether validation of the XML document is complete (step 1212). In particular, the virtual machine examines whether or not additional instructions, in the set of instructions, are to be compared to part of the XML document or if there are other parts of the XML document that need to be compared to a particular instruction. In either case, if the validation of the XML document is not complete, then the process returns to step 1202. However, if validation of the XML document is complete, or if that particular part of the XML document has already determined to be valid in (yes to step 1206) and validation of the XML document is complete (yes to step 1212), then the process terminates.
The process begins as the scanner parses an XML message (step 1300). The scanner then checks the format of the message (step 1302). The scanner determines whether the format is valid (step 1304). If the process is not valid, then a process error is generated (step 1306) and the process terminates thereafter.
However, if the format of the XML message is valid in step 1304, then the scanner forms a scanner event (step 1308). The scanner event is a part of the XML message described with reference to step 1300. The scanner then transmits the scanner event to an event queue (step 1310), with the process terminating thereafter. Although the process is described as terminating at this point in
The process begins as the virtual machine fetches a scanner event from the event queue (step 1400). The virtual machine then determines whether the scanner event has been previously validated (step 1402). If the scanner event has been previously been validated, then the process terminates.
However, if the scanner event has not been validated previously (a “no” response at step 1402), then the virtual machine validates the scanner event (step 1404). The virtual machine then requests creation of a new state node of an automaton (step 1406). The virtual machine then transmits objects to an automaton processor (step 1408), with the process terminating thereafter.
The automaton processor described with respect to step 1408 can be an automaton processor in a validation engine, such as automaton processor 404 of validation engine 400 shown in
The process begins as the virtual machine stores a reference to an instruction (step 1500). The virtual machine then compares a scanner event with previously parsed contents of other instructions (step 1502). In this way, the virtual machine validates the scanner event. The virtual machine then notifies the scanner of the validation results (step 1504). Finally, the virtual machine transmits scanner context to the scanner (step 1506). The virtual machine also creates or updates a state node in the corresponding automaton (step 1508). The automaton processor then generates a new automaton for use in validating further XML document fragments (step 1510). The process terminates thereafter.
The process begins as the validation engine maintains a record of document fragments that have been previously validated against a schema (step 1600). The validation engine then compares a target document to the document fragments to identify portions of the target document that are schematically identical to corresponding document fragments (step 1602). The validation engine then determines whether a portion of the target document is schematically identical to corresponding to the corresponding document fragment (step 1604). If the portion of the target document is schematically identical to a corresponding document fragment, then validation of the portion of the target document is omitted (step 1606), and skips to step 1612. However, if a portion of the target document is not schematically identical to corresponding to a corresponding document fragment, then the validation engine validates that portion of the target document (step 1608). The validation engine then adds the valid document fragment to the record of document fragments (step 1610). The validation engine then determines whether additional portions of the target document are to be analyzed (step 1612). If additional portions of the target document are to be analyzed, then the process returns to step 1604. Otherwise, the process terminates.
The process begins as a validation engine parses a target document into a first part of the target document and a second part of the target document (step 1700). The validation engine then compares the first part of the target document to a document fragment that was previously validated against a schema (step 1702). The validation engine then determines whether the first part of the target document matches the document fragment (step 1704). If the first part of the target document matches the document fragment, then the validation engine will omit the validation of the first part of the target document (step 1706). The process then continues at step 1712. However, if the first part of the target document does not match the document fragment, then the validation engine validates the first part of the target document (step 1708). The validation engine then adds the first part of the target document to a set of document fragments (step 1710).
The validation engine then determines whether a second part of the target document matches one of the document fragments in the set of document fragments (step 1712). If the second part of the target document does match one of the document fragments in the set of document fragments, then the validation engine omits validation of the second part of the target document (step 1714). The process will then continue with step 1720. However, if the second part of the target document does not match one of the document fragments in the set of document fragments, then the validation engine will validate the second part of the target document (step 1716). The validation will then add the second part of the target document to the set of document fragments (step 1718).
The validation engine then determines whether additional parts of the target documents are to be analyzed (step 1720). If additional parts of the target document are to be analyzed, then the validation engine repeats validation or skipping of validation for each additional part of the target document (step 1722). During this process, the validation engine will validate those additional parts of the target document that have not already been validated. The validation engine will skip validation of those additional parts of the target document that match one or more document fragments in the set of document fragments. For each document fragment that the validation engine does validate, the validation engine will add those additional new parts of the target document to the set of document fragments (step 1724). The process then returns to step 1720. If no additional parts of the target document are to be analyzed at step 1720, then the process terminates.
Thus, the illustrative embodiments described herein provide for a method, apparatus and computer usable program product for validating XML documents against an XML schema. However, the methods and devices described herein can be applied to other schema languages and other structured language documents. Thus, the illustrative embodiments described herein provide a mechanism for increasing the efficiency and speed of validating target XML documents against an XML schema. More generally, the illustrative embodiments described herein provide a mechanism for quickly and efficiently validating documents written in a structured language against a structured language schema. The illustrative embodiments described herein create a faster mechanism for validating structured language documents because those portions of a particular structure language document that have already been validated do not have to be further validated.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any tangible apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.