Computer systems may exchange data, e.g., via a network or other connection or path, in many forms. Abstract Syntax Notation number One (“ASN.1”) and extensible markup language (“XML”) are two of many powerful tools currently used widely to represent and exchange data. For example, XML is a meta-language that allows one to define how data will be represented in a manner understandable to others across platforms, applications, and communications protocols. The current version of XML, XML 1.0 (2nd Ed.), is specified by the World Wide Web Consortium (W3C) in the W3C Recommendation entitled “Extensible Markup Language (XML) 1.0, Second Ed.”, dated Aug. 14, 2000, available at http://www.w3.org/TR/REC-xml, which specification is incorporated herein by reference for all purposes.
XML may be used to exchange data for many useful purposes. One growing area of use is the web services sector. The term “web services” refers generally to the idea of using a first computer, e.g., an application server, to perform computations or other processing tasks for one or more other computers that have access to the first computer via a network, such as the World Wide Web. For example, a client computer may be configured to invoke an application or other process running on a server computer with which the client is configured to communicate via a network by sending to the server a “remote procedure call” identifying, e.g., the processing to be performed and providing the input data, if any, required to perform the operation. Depending on the nature of the application or process running on the server and/or the remote procedure call (RPC), the server may be configured to return to the client (or some other destination) some result of the processing or computation performed by the server. For example, a web-based airline reservation service may contract with a third party to process credit card transactions based on reservation, credit card, and price information passed to the third party's server by one of the airline reservation service's systems.
To facilitate the use of web services and similar technologies, the W3C has developed the Simple Object Access Protocol (SOAP), as described in the SOAP Version 1.2 specification, dated Jun. 24, 2003, a copy of which is available on the web at http://www.w3.org/TR/soap12, which is incorporated herein by reference for all purposes. SOAP defines a lightweight communications protocol for sending requests to remote systems, e.g., an RPC to a remote web services platform. SOAP requests and responses are encapsulated in a SOAP “envelope”. The envelope includes a header portion that includes information about the request and how it should be handled and processed and a body portion that includes the request itself and associated data. SOAP requests and responses may be sent using any suitable transport protocol or mechanism, e.g., HTTP. In many cases, the request and associated data included in the body portion of a SOAP request take the form of an XML document (or other infoset), due to the platform and application independent nature of XML as described above.
Whether for purposes of sending and receiving web services requests and responses, e.g., SOAP requests and responses, or for any other purpose requiring an exchange of data, such as in the form of an XML document, it is often important to validate data prior to sending or (if received) processing it, e.g., to detect errors in the data or how it is represented in order to avoid generated incorrect or otherwise undesired results, avoid application or system errors or failure, avoid security breaches or other comprises to data, etc.
One way to validate an XML document, for example, is to verify that the document conforms to the structure and content rules prescribed for the document. Under the XML specification, a document type definition (DTD) may be used to define the structure and content of XML documents of the type defined by a particular DTD. A DTD may be used, e.g., to define the data elements that may occur in an XML document governed by the DTD, the attributes associated with each element, the type (e.g., format or nature) of data values that may be associated with each element and/or attribute, and the relationship of elements to each other (e.g., which elements are sub-elements of which other elements, how many times may an element occur, must elements occur in a particular order, etc.). Other definitions such as XML schema, Schematron (specification available at http://www.schematron.com/spec.html), ASN.1 Module Definitions, or other definition information may be used to define the structure and content of documents.
The XML schema definition language provides additional tools that can be used to define a class of XML documents. The XML schema language is described and defined in the following W3C documents: XML Schema Requirements, dated Feb. 15, 1999, available at www.w3.org/TR/NOTE-xml-schema-req; XML Schema Part 1: Structures, dated May 2, 2001, available at www.w3.org/TR/xml-schema-1; and XML Schema Part 2: Data Types, dated May 2, 2001, available at www.w3.org/TR/xmlschema-2. Like a DTD, an XML schema is used to define data types and prescribe the grammar for declaring elements and attributes. The XML schema language provides an inventory of XML markup constructs that may be used to create a schema that defines and describes a class of XML documents. Syntactic, structural, and value constraints may be specified.
XML parsers configured to use schema to validate XML documents have been provided. For example, a SAX (Simple API for XML available at www.saxproject.org) type XML parser may be configured to use XML schema to validate XML documents. However, such validation may consume significant processing resources and may be difficult to complete in the time required to validate hundreds or thousands of XML documents (e.g., SOAP or other transactions) per second, as may be required in a web services or other environment. Therefore, there is a need for a reliable and efficient way to accelerate validation of XML documents and similar data sets and files using a definition such as an XML schema.
Various embodiments of the invention are disclosed in the following detailed description and the accompanying drawings.
The invention can be implemented in numerous ways, including as a process, an apparatus, a system, a composition of matter, a computer readable medium such as a computer readable storage medium or a computer network wherein program instructions are sent over optical or electronic communication links. In this specification, these implementations, or any other form that the invention may take, may be referred to as techniques. A component such as a processor or a memory described as being configured to perform a task includes both a general component that is temporarily configured to perform the task at a given time or a specific component that is manufactured to perform the task. In general, the order of the steps of disclosed processes may be altered within the scope of the invention.
A detailed description of one or more embodiments of the invention is provided below along with accompanying figures that illustrate the principles of the invention. The invention is described in connection with such embodiments, but the invention is not limited to any embodiment. The scope of the invention is limited only by the claims and the invention encompasses numerous alternatives, modifications and equivalents. Numerous specific details are set forth in the following description in order to provide a thorough understanding of the invention. These details are provided for the purpose of example and the invention may be practiced according to the claims without some or all of these specific details. For the purpose of clarity, technical material that is known in the technical fields related to the invention has not been described in detail so that the invention is not unnecessarily obscured.
XML gateway 108 is shown in
In some embodiments, the XML gateway 108 may be configured to perform accelerated XML schema-based validation on an XML document by learning the structure and validation rules of an earlier-received and processed XML document, recognizing that a subsequently received XML document has the same structure as the earlier-received document, and then using the structure and validation rules learned from processing the earlier document to validate the later-received document without performing a full validation using a validating XML parser. In some embodiments, this latter step is performed by using the structure information learned from processing the earlier-received document to find and apply applicable validation rules just to the data values of the later-received document; i.e., by not validating separately the structural portions common to the two documents.
The approach disclosed herein to accelerating structured data validation. In some embodiments, accelerating XML schema-based or similar validation takes advantage of the fact that in many cases, such as in a web services environment, hundreds or thousands of very similarly structured XML documents, e.g., SOAP or other web services requests and/or replies, may need to be validated every second. Often, computers or other devices generate these requests, as opposed to humans, and as a result requests of a particular type (e.g., generated by a particular application, type of server, etc.) tend to be identical in structure and differ only in the values of certain data elements and/or attributes. An XML or similar document can be thought of as having a tree structure. The branches define the structure of the document, which does not vary between documents of a particular type, and “leaf nodes” represent the data values that can change from one instance of the XML document to the next. In the approach disclosed herein, once it is recognized that the structure (i.e., the branches) of two XML documents is the same, only the data values (i.e., the leaf nodes) of the later-received document are validated, using validation rules learned either by pre-processing a schema with which the documents are associated or from processing the earlier-received document. The structure portion is not validated separately for the later-received document because it has already been validated through the validation of the earlier-received document. XML validation is merely an illustrative example. Other forms of structured data may be validated using the approach disclosed herein. In some embodiments, ASN.1 data is validated using the approach disclosed herein.
While the example shown in
This simple document describes a toy that is a red ball. Such a document might be an instance of a class of documents defined by an XML schema that defines an element “toy” having a first sub-element “type” and a second sub-element “color” each of which can have a string of characters as its data value. The schema might define further constraints for either the sub-elements or their associated data values, e.g., constraints relating to the order in which the sub-elements must appear, the number of times each sub-element must or may appear, etc. A SAX-type parser is used in some embodiments to distinguish between structural portions and data values. A SAX parser recognizes strings in the form <tag> as element start tags and strings in the form </tag> as element end tags. Each such start or end tag generates an “event” that initiates appropriate parsing by the parser to identify and extract the elements and data values associated with the start and/or end tag. In some embodiments, the structure portions of the document are identified and added to the portions of the document used to calculate a value representative of the structure of the document, sometimes referred to herein as a “structure value”. In some embodiments, the structure value is a hash value calculated based on the string of characters associated with the structure of the document. Use of a parser to distinguish between structure portions and data values is discussed more fully below, e.g., in connection with
In some embodiments, the determination performed in step 204 is based on only a subset of the data associated with the document. For example, a SOAP envelope and header might not be included in the calculation of the structure value. Also, information such as the version of XML being used, e.g., might not be included in the calculation.
In step 206 of the process shown in
If it is determined in step 206 that the structure of the received document matches the structure of a previously received and validated document, the process advances to step 214 in which an accelerated validation is performed. Information learned from validating the previously received document is used to validate the current document without performing a full schema-based validation. In some embodiments, information about the location in documents having the structure of the received document of data values to be validated and the validation rule(s) applicable to each data value that was learned from processing the previously received document that had the same structure is used to quickly validate just the data values in the later-received document. In some embodiments, the structural portions of the later-received document, which were validated fully in the previously received document and which have been determined in step 206 to be the same as the corresponding portions of the previously received document, are not validated for the later-received document. In some embodiments, while the structure is determined in step 204 an array of element and attribute data values is built in which each data value associated with the result of the structure value calculation up to the point in the document at which the data value is found. In such embodiments, step 214 comprises running quickly through the array of data values, finding the applicable validation rule(s) by finding for each data value the corresponding entry in the table or other structure associated with the previously processed document having the same structure, and applying to each value to rules applicable to it. Once the accelerated validation is completed, the process ends. In some embodiments, the structure of a type of XML document may be learned by preprocessing a schema associated with the document type.
The schema for the above document might be identified, e.g., based on the root element <toy>, or other identifying information in the document. The schema might, e.g., define a structure in which each element <toy> must comprise one and only one sub-element <type> and may comprise either no or one sub-element <color>, each sub-element being a character string. Validation would comprise checking to see that the element <toy> in the above instance of the class of documents defined by the schema satisfies all the constraints defined in the schema. In this case, it would be determined that the element <toy> comprises the required sub-element <type> with an associated data value that is valid for that sub-element (i.e., the character string “ball”) and permissibly includes one occurrence of the optional sub-element <color> with an associated data value that is valid for that sub-element (i.e., the character string “red”). The schema might impose further or different constraints than those supposed above by way of example.
If the portion being processed is a data value (404), the process advances to step 408, in which the location of the data value within the document and any associated validation rule(s) are learned. In some embodiments, the location of the data element is learned by storing the current value for the structure value computed in step 406 as of the most recently processed structural portion, because that value represents the structural portions of the document up to that point in the document and would be the same as the corresponding value calculated for a subsequently received document having the same structure. In some embodiments, the location of the data value and associated validation rules are stored in a data structure, such as a table. In some embodiments, a pointer to the validation rule is stored.
Once a portion of the document has been identified and processed as either a structure portion or a data value in step 406 or step 408, as applicable, it is determined in step 410 whether the portion just processed is the last portion of the document required to be processed. If the portion just processed is determined to be the last portion required to be processed, the process ends. Otherwise, the next portion to be processed is received or identified in step 412 and the process repeats as to that portion and any subsequent portions until the entire document has been processed.
If it is determined in step 508 that one or more applicable validation rules are not satisfied by the data value, error processing is performed in step 514. The error processing may comprise sending an alert, blocking a request or response associated with the document, setting a flag in a request or response, or any other responsive action that may be desired or appropriate in a particular implementation. Once the error handling has been performed, the process proceeds to step 510 and continues as described above. In some alternative embodiments, if invalid data is detected in step 508, the processing of a document ends after error process has been performed in step 514. In such embodiments, the arrow shown in
If the process of
By using the accelerated approach described above, time and computing resources are saved because the structure portions of the subsequently received document, which is the same as the corresponding previously validated document, is not validated again each time a document having the same structure is received. Also, the data values that require validation, as well as the validation rule(s) that apply to them, can be located quickly without requiring that the schema be consulted and processed, for example.
In some embodiments, the processes shown in
may also be defined to be represented as:
In some embodiments, attributes embedded in XML tags are identified as leaf nodes and validated individually as described above. In some such embodiments, the name of the attribute is added to the structure value calculation and the data value added to an array of data values to be validated. A SAX type parser, for example, could be configured to recognize such attributes included in tags and generate an “event” when an attribute is encountered. In other embodiments, at least certain types of data values included as attributes defined within tags are validated during full validation but are processed as part of the “structure” of the document for purposes of the processes of
A further consideration is the fact that an XML schema may or may not specify or require a particular order of elements. If a specific order is not required, the order of elements may vary between different instances of the same document class/schema, even though the documents are structurally identical in all other respects. In some embodiments, this variability is addressed by reordering elements to ensure that the elements in documents associated with a particular schema appear in the same order, and then calculating a structure value on the reordered document, e.g., for purposes of determining whether the structure is the same as a previously processed document, as in steps 204 and 206 of
Finally, the number of at least certain elements may vary between valid instances of a single class of documents defined by the same schema. For example, in an organization chart document, the root element <orgchart> may be permitted to include one or more department sub-elements, e.g., <dept>, each of which may include one or more employee name sub-elements, e.g., <employeeName>. Two valid instances of such a class may have varying numbers of departments and/or different numbers of employees within one or more departments, which might result in their structure values determined as described above being different even though their structures are very similar. In some embodiments, the potential variation in the number of occurrences of an element is handled with respect to an element that must occur once but may occur more times by calculating the structural value using only the first occurrence of such an element and omitting subsequent occurrences from the structure value calculation. In some embodiments, an element that may not occur at all in a valid instance of a schema is omitted entirely from the structure value calculation, whether it occurs or not. This approach has the benefit of reducing the number of unique structures of which one must keep track. However, it complicates the task of quickly determining the location of data values in any particular instance of the class defined by the schema and associating validation rules with the data values. In some embodiments, the proliferation of unique structure values (or types) is tolerable and no attempt is made to associate documents of the same type but different numbers of elements together. Instead, each unique structure generates a unique structure value and associated data value location and validation rule information, and only those subsequently received documents that have the exact same structure (i.e., down to the number of occurrence of the various elements) are determined to have the same structure for purposes of performing accelerated validation.
In some embodiments, the structure of a document type in which the number of times that one or more elements occur may vary is learned by modifying the process of
In some embodiments, special provisions are made to avoid having two documents being determined to have the same structure value, such that only the data values are validated, even if the structural portion of the document is not well formed. For example, in some embodiments a character “e” is added to the structure value calculation whenever an end tag is encountered (i.e., a tag in the form </tag>) to avoid having the following two documents being found to have the same structure value: <foo><bar>text</bar></foo> and <foo><bar>text</foo>. In some embodiments, provisions are similarly made for such variations as permissible white spaces, start tag and end tag pairs that do not include any data value, etc.
The use of the XML meta-language format and XML associated definitions is merely an illustrative example. Other meta-language formats, other structured document formats, and other associated definitions may be used in one or more of the processes and systems described above.
Although the foregoing embodiments have been described in some detail for purposes of clarity of understanding, the invention is not limited to the details provided. There are many alternative ways of implementing the invention. The disclosed embodiments are illustrative and not restrictive.
This application claims priority to U.S. Provisional Patent Application No. 60/584,780 (Attorney Docket No. REACP003+) entitled ACCELERATED SCHEMA-BASED VALIDATION filed Jun. 30, 2004 which is incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
60584780 | Jun 2004 | US |