The foregoing and other objects, features and advantages of the invention will be apparent from the following description of particular embodiments of the invention, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
The disclosed configurations depict a process of input specialization that begins with a program written against the abstract XML data model described above—or any suitable data model with the above-described characteristics, such as the XPath data model—and a set of input-specialized data structures, which may be derived from an XML type definition language, such as XML Schema or other suitable language. The process is not limited to any particular such abstract model, or any particular set of concrete data structures, provided that the abstract model conforms to the general description of the node relationships above (notably four-way inter-relationships, and QName lookup), and that the concrete input-specialized data structures conform to the corresponding general description above (notably unidirectional relationships, and implied structure and naming). In an exemplary configuration of the method, we will refer to the canonical example of an XSLT program (which uses the XPath data model), being specialized to a set of Java classes, derived from an XML Schema (such as those produced by the mappings of JAX-RPC).
In contrast, in configurations herein, given a set of input-specialized data structures, and a mechanism by which to build them from an input XML document, the first step in the specialization process is to produce an in-memory representation of the program (in XSLT also called a stylesheet), where the input is assumed to be a generic data structure such as the DOM, or any other which closely models the generic abstract model of the program. The program is represented with an abstract syntax tree (AST), where the functions (in XSLT these correspond to templates) all take one or more parameters of the generic node type, and contain a body which is the expression for the function's result in terms of its parameters. In XSLT, templates all take an implied parameter, which is the current node. In the AST, these implied parameters are made explicit. Furthermore, XSLT supports a calling convention, apply-templates, in which the template to be called is determined by comparing the current node to a match pattern associated with a whole set of templates. In the AST, this is can be represented explicitly as a function in which the match patterns of the relevant templates are rewritten as Boolean-valued XPath expressions indicating whether the current node is matched. These expressions are evaluated in a conditional loop, whose branches contain explicit calls to their matched template. In languages other than XSLT, processing of similarly implied constructs will be performed to make the AST a simple, explicit program.
The process of program specialization begins at the entry point (or points) to the program 150. In XSLT, this is the initial invocation of the apply-templates function with the root of the document as the current node. Specialization begins at this call, by specifying that the root node is of the type corresponding to the document-root's representation in the input-specialized data structures 180.
Each call to an input-specializable function in the AST 162 is annotated with a new, input-specialized type signature, containing the input-specialized types 120 of each of the arguments 104. A complete copy of the called function F1, F2 is made for every unique calling signature, and the body expression of that function is recursively rewritten in terms of operations over the input-specialized data structures, at each step annotating the program with the calculated input-specialized type of each expression. When a call to another function is encountered, the input-specialized call signature is calculated, and the corresponding specialized copy of that function is queued for rewriting.
For each specialized copy of a function, the value expression is recursively rewritten in terms of expressions that operate on the specialized types 120. For example, an expression which, in the original version A-C, access an input node's child relation 14-5 with a given QName will be rewritten in terms of operations which access the appropriately named child field of the input-specialized type 120. Similarly, the input-specialized type 120 of every expression is calculated with reference to the original expression, and the input-specialized types of its arguments. Thus, for example, the type of the above child expression is determined to be the type of the named member in the argument's input-specialized structure. This process is carried out recursively through the AST tree 162, such that the resulting copy of the function is composed only of operations over the input-specialized types 120.
The parser 172 processes the syntax tree 162 to identify function invocations F1 and F2 including data element references included in the input specialized data structures 180. The mapper 174 identifies the input specialized data structures A′, B′ and C′ (120) corresponding to the data element references A, B and C (104) from the application program 150. The signature generator 172 employs the mapped data elements A′ B′ and C′ to replace the function invocations F1 and F2 with the input specialized function references (signatures) F1′ and F2′ 192 in the output application program 190 including the input specialized calls 192. Accordingly, the input specialized data elements 1941.194-3 are operable to access the corresponding data item 196-1.196-3 via a single offset indirection 198, thus avoiding an iteration of pointer references and name matching typically associated with DOM based references in an application program.
In the simplest content models, where the content is just a sequence of elements, named child expressions will reference one member of the input-specialized data structure. In more complicated cases, it may be necessary to reference several members. In this case, a more complex expression will be used to retrieve all of the relevant children, and gather them into a result set. These results might be encoded in a variety of ways, including lists or arrays, but also possibly tuples or even lambda expressions which, when evaluated, return the desired result—or, of course a combination of any of these representations. In particular, more complicated schemes, perhaps involving unions, or union-like structures, may be desirable when all of the result nodes are not of the same type. For configurations including XML Schema, all identically named children are restricted to be of the same type, and so in many cases, a simple list or array will suffice.
Rewriting of simple expressions involving child relations is straight-forward; an expression which accesses a named child of an input node is rewritten to access the named member or members of that node's input-specialized type. However, in the case of the other relations (parent, next and previous)—or extended relations derived from them (e.g. XPath's ancestor axis)—the conversion may employ additional processing. Since the input-specialized types do not include accessors for these other relationships, support for such expressions must be achieved by saving references to parent nodes further up the expression tree, while references to those nodes are still in scope. In particular, this means that the actual type used for any node in the expression is not just the type stipulated by or derived from the calling context, but is, in fact a collection of that node, and any of its ancestor nodes which may be required by dependent expressions. This collection could be implemented in a variety of ways, for example, as a tuple, or a list. Within a function, these dependencies are resolved while evaluating the expressions to determine their input-specialized types. For example, if the result of a particular expression is used in a subsequent expression that would require its parent (or more distant ancestor), then the type of that expression is augmented with the relevant parent/ancestor node to reflect the additional dependency. These dependencies are propagated up the input-specialized type annotations on the expression tree for the function during regular function specialization. For ancestor dependencies which cross function boundaries, the propagation is performed across the whole (potentially recursive) function call stack repeatedly until the full set of dependencies is resolved. As a result, all of the functions in the call-stack will be modified to prepare for such back-references. For example, if a function takes as an argument a given node, and in its value expression, accesses its grandparent node, then an annotation is made on that function argument, stating that the node must be passed in with its two ancestors; furthermore, any variable in another function that supplies that variable is similarly annotated as needing its two ancestors to be remembered. If such a variable X is the result of a child step from yet another variable Y, then Y is annotated as needing only one of its ancestors, and so on. The process of “remembering” means that, whereas in the original code, a variable X might require a single value to be passed, the new code might require 2 or more values to be passed along in the X variable, depending on the number of ancestors that needs to be remembered. Expressions for siblings are handled similarly, as that access is made via the parent node.
The choice of representation of ancestor nodes may vary according to the needs of the program. For example, if the input-specialized type system is recursive, it may not be possible to bound the number of ancestors required for a given function (especially if that function is also recursive). In such a case, the tuple representation may not be appropriate, and a list or other representation will be preferred. This does not present an insurmountable problem, however, since the recursion is easily detected during specialization analysis, and dealt with accordingly.
During expression rewriting, application of the input-specialized type system to the original, generically typed program may render some branches of the program unreachable. This can be a source of significant performance improvement, as the runtime check for those branches may be eliminated statically. A good example of how this operates can be seen in the implied apply-templates function of an XSLT program. Typical usage of apply templates will select a particular named descendant of the current node, and apply templates on it. With the input-specialized type of that node (and thus its name) known, the number of template match expressions that can possibly evaluate to true is greatly reduced (since most of the templates will match on a distinct name). Indeed in the most common usage, where there is only one match pattern that accepts the given named node, the specialized apply-templates function for that call will be optimized down to a direct call into a specific template, a so-called partial evaluation operation.
Once all of the reachable functions in the program are specialized, the unused, original copies of the functions are removed from the AST. The result is complete version of the program, rewritten to operate on the efficient, lightweight, input-specialized data structures. When execution code is generated for this AST, it is coupled with the deserializing parser described above, to produce a fully functional version of the program that leverages the superior memory and access characteristics of the specialized data structures to achieve significant performance improvement over the generic version. Thus an executable is automatically generated from the high-level dynamically typed source, which has comparable performance and memory characteristics of a low-level program written against task-specific data structures.
The parser 170 in the program specializer 160 parses the application program 150 to identify data element references 104 to data elements in the generated input specialized definitions of data elements 120, as shown at step 302. In the arrangement shown, parsing includes generating an abstract syntax tree 162 indicative of the references 104 to data elements, as depicted at step 303. Building the abstract syntax tree (AST) 162 includes generating a memory resident version of the application program 150 represented as a hierarchical tree structure (such as the AST 162), as shown at step 304. The AST or other memory resident structure identifies the data element references to be replaced with input specialized data element references 120.
The parser 170 traverses the syntax tree 162 representation of the application program, as depicted at step 305. During the traversal, the parser identifies DOM references including XSLT based XPath expressions, responsive to input specialization as defined herein. Such expressions are those replaceable by one or more of the input specialized data structures 120. The signature generator 172 computes an expression indicative of an implied parameter representing a current node, and the mapper 174 matches a function invocation by specifying a Boolean expression indicative of the current node, as depicted at step 306. Thus, the program specializer 160 traverses the hierarchical tree structure 162 to identify data element references 104 defining function F1, F2 parameters having a generic node type, as disclosed at step 307. The traversal therefore identifies function invocations 152 including the data element references 104, as depicted at step 308.
For each data element reference 104 traversed, a check is performed to identify if it is encompassed with a complementary input specialized data structure 120 in the input specialized data structures 180 generated previously, as shown at step 309. If so, then the signature generator 172 computes an input specialized definition F1′, F2′ corresponding to each of the identified data element references, as depicted at step 310. In the exemplary configuration, this includes, at step 311, determining an index for offset indirection, as shown at step 311, and thus further involves generating an input specialized definition 120 having offset references to members of the data element A-C, as disclosed at step 312, such that the data element A-C members are operable for indexed references 194 by the application program 190.
A check is performed, at step 313, to identify unused data members and/or attributes of the input specialized definition 120. As indicated above, the DOM based definitions tend to be over inclusive, and therefore may include elements unused in a particular arrangement. If unused members are found, then parsing invokes partial evaluation, partial evaluation including identifying unused attributes in the parsed application program, and removing operations including the unused operations, as depicted at step 314. Such removal eliminates code for retrieving and comparing names of node elements, as shown at step 315.
Another check is performed for references to ancestor nodes corresponding to parent node traversals, as shown at step 316. As indicated above, the program specializer operates on a unidirectionally linked structure that may be linked only in the child node direction. Accordingly, such parsing further includes identifying ancestor references to data elements 104, in which the ancestor reference has unidirectional relations opposed to the relations in the input specialized definition (i.e. attempting to get a parent in a child-only linking), and computing a previous invocation to the ancestor reference. The parser 170 employs a computed previous invocation for replacing the ancestor reference, as depicted at step 317. In other words, at some point in the traversal, the now sought parent node has been referenced, at which point the location is stored for future ancestor references.
Having identified appropriate references for input specialization, the parser 170 annotates the identified invocations with a signature indicative of a set of input specialized definitions 120, each of the input specialized definitions 120 corresponding to a markup based argument A-C of a function invocation F1, F2, as shown at step 318. This annotation includes replacing the identified data element references 104 with the corresponding input specialized definition 120, as depicted at step 319. In particular instances, the data element reference 104 may be a child reference to an attribute, and replacing further includes replacing with a named child expression indicative of the type and name of the attribute, as depicted at step 320. Such a named child attribute is indicative of type and name by virtue of the location, or offset, in the reference, rather than requiring a traversal and name matching. The data element references may define markup language elements in parameters to function invocations, in which replacing further includes substituting an offset based expression for a pointer traversal operation, as shown at step 321. Therefore, such replacing or rewriting involves replacing element references with a single deterministic reference 194 indicative of the data element 196, as depicted at step 322, such that the single deterministic reference 194 avoids multiple pointer traversals, i.e. is an offset reference, rather than a pointer to a more complex pointer structure with multiple levels of indirection and node matching.
The parser 170 continues traversing to generate a signature for each function invocation 104, such that each signature is indicative of input specialized parameters 120 appropriate for the function invocation, as shown at step 323. Upon completion, the program specializer 160 generates an input specialized program 190 having input specialized references 194 to input specialized data structures 196, as depicted at step 324.
The disclosed configurations may result in large amounts of new code, some parts of which are repetitive, some parts of which have dangling references, and many parts of which can be optimized. Configurations herein optimize this code using partial evaluation in order to bring the code size back down to the approximate size it was prior to input specialization.
Those skilled in the art should readily appreciate that the programs and methods for processing markup data using an input specialized data structure as defined herein are deliverable to a processing device in many forms, including but not limited to a) information permanently stored on non-writeable storage media such as ROM devices, b) information alterably stored on writeable storage media such as floppy disks, magnetic tapes, CDs, RAM devices, and other magnetic and optical media, or c) information conveyed to a computer through communication media, for example using baseband signaling or broadband signaling techniques, as in an electronic network such as the Internet or telephone modem lines. The disclosed method may be in the form of an encoded set of processor based instructions for performing the operations and methods discussed above. Such delivery may be in the form of a computer program product having a computer readable medium operable to store computer program logic embodied in computer program code encoded thereon, for example. The operations and methods may be implemented in a software executable object or as a set of instructions embedded in a carrier wave. Alternatively, the operations and methods disclosed herein may be embodied in whole or in part using hardware components, such as Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs), state machines, controllers or other hardware components or devices, or a combination of hardware, software, and firmware components.
While the system and method for processing markup data using an input specialized data structure has been particularly shown and described with references to embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the invention encompassed by the appended claims.