This application claims priority under 35 U.S.C. §119 to Chinese Patent Application 201210126487.9, filed on Apr. 26, 2012, titled “PARTITION BASED STRUCTURED DOCUMENT TRANSFORMATION”, which is incorporated herein by reference in its entirety.
Embodiments generally relate to computer systems, and more particularly to methods and systems for transforming structured documents.
Several reporting software, such as SAP® Business One, transform business data, received in form of a structured document, into another format according to the requirement of the user. In many cases, instructions may be provided for transforming the structured document into another format. For example, an extensible stylesheet transformation language (XSLT) may be defined to translate an extensible markup language (XML) document, which is a structured document, to another structured or unstructured document form (such as plain text, word processor, spreadsheet, database, pdf, HTML, etc.).
Typically, for transforming an XML document, XSLT builds a Document Object Model (DOM) tree, which has a node corresponding to each element of the XML document. The XSLT then performs the transformation operation on the created DOM tree. A DOM tree consumes memory size linear to the size of XML document. Therefore, if the size of the XML document is more than the available system memory then the transformation process may throw an out of memory exception.
The claims set forth the embodiments of the invention with particularity. The invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. The embodiments of the invention, together with its advantages, may be best understood from the following detailed description taken in conjunction with the accompanying drawings.
Embodiments of techniques for partition based structured document transformation are described herein. In the following description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention can be practiced without one or more of the specific details, or with other methods, components, materials, etc. In other instances, well-known structures, materials, or operations are not shown or described in detail to avoid obscuring aspects of the invention.
Reference throughout this specification to “one embodiment”, “this embodiment” and similar phrases, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of these phrases in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
The source XML document, shown in Table 1, includes nodes <school>, <class><student>, <address> and <street>. The node <student> has values James and Michael and the node <street> has value XYZ. The transformation operation may be performed on the source structured document 102 to obtain a target document 104 according to the requirement of the user. For transforming the source structured document 102, initially a divide operation 106 is performed on the source structured document 102 to obtain several portions 108 of the source structured document 102. The source structured document 102 may be divided based on the nodes in the source structured document 102. As shown, the source structured document 102 is divided into a portion 1110 and a portion 2112 of the source structured document 102. In the above example, the source structured document, in Table 1, may be divided according to the <class> node and the <address> node to obtain two portions of the source structured document, shown in Table 2 and Table 3, respectively.
and
Next a transform operation 114 may be performed on the portions of the source structured document 108 to obtain portions 116 of the target document 104. The transform operation 114 may be performed in several steps, at each step one of the portions 108 of the source structured document 102 may be transformed to obtain one of the portions 116 of the target document 104. For example, a transform operation 114 may be performed on the portion 1110 of the source structured document 102 to obtain a portion 1118 of the target document 104. Next a transform operation 114 may be performed on the portion 2112 of the source structured document 102 to obtain a portion 2120 of the target document 104. In one embodiment, transformation rules are defined to transform portions 108 of the source structured document 102 to obtain portions of the target document 116. In the above example, transformation rules may be defined to transform the <student> node, included in the portion 1 of the source structured document, shown in Table 2, to a “FOUND A LEARNER” output and the <street> node, included in the portion 2 of the source structured document, shown in Table 3, to a value of <street> node (XYZ) output. Using the transformation rule, the portion 1 of the source structured document, shown in Table 2, may be first transformed to obtain a portion 1 of the target document, shown in Table 4, which includes:
Next, the transformation rule may be applied to portion 2 of the source structured document, shown in Table 3, to obtain a portion 2 of the target document, shown in Table 5, which includes.
Finally, a generate operation 122 is performed to generate the target document 104 using the obtained portions of the target document 116. The target document 104 may be generated by combining the obtained portion 1118 and the obtained portion 2120 of the target document 104. In the above example, the portion 1, shown in Table 4, and portion 2 of the target document, shown in Table 5, are combined to obtain the target document, shown in Table 6, which includes:
In the above example, <business data> is a root node, <companies>, <company 1>, <company 2> and <customer> are branch nodes, and <company name>, <location>, and <customer name> are leaf nodes. The node <company name> has values HAPPY COMPANY and WINNER INC., the node <location> has values SUNNYVALE and NEW JERSEY, and the node <CUSTOMER NAME> has value BIG CUSTOMER.
Next at block 204 nodes in the source structured document are selected, for partitioning the source structured document. A user may select one or more nodes from the source structured document, based on which the user wants to divide the source structured document. A display tool may be provided which displays the source structured document to the user. The user may select one or more nodes from the displayed source structured document. In one embodiment, the system may allow the user to select only some of the nodes in the structured document for dividing the source structured document. For example, the user may be allowed to select only the root node and the branch nodes but not the leaf nodes. In the above example, the source structured document may be presented to the user. The user may be allowed to select only the root node <business data> and the branch nodes <companies>, <company 1>, <company 2>, and <customer>. Assume that the user selects the branch nodes <companies>, <company 1> and <company 2>.
The selected nodes may be defined as placeholders for the source structured document. A placeholder is a node that refers to a portion of the source structured document. In one embodiment, the placeholder is a node that indirectly refers to a portion of the source structured document. In this case, the placeholder may refer to another placeholder that refers to a portion of the source structured document. In the above example, the selected nodes <companies>, <company 1>, and <company 2> are defined as placeholders of the source structured document. The placeholder <company 1> and the placeholder <company 2> may refer to a portion 1 and a portion 2, respectively, of the source structured document shown in Table 8 and Table 9, respectively.
The placeholder <companies> indirectly refers to the portion 1 and portion 2 of the source structured document, which means that the placeholder <companies> refers to the placeholder <company 1> and the placeholder <company 2> which refers to portion 1 and portion 2, respectively, of the source structured document. Next at block 206, the source structured document may be divided based on the portions of the source structured document referred by the defined placeholders. Dividing the source structured document may include storing each of the portions of the source structured document, referred by the defined placeholders, as separate structured document files. The placeholder may refer to the structured document file storing the portion of the source structured document. In the above example, the source structured document is divided into two portions, a portion 1 referred by the placeholder <company 1> and a portion 2 referred by the placeholder <company 2>. A first structured document file (file 1.xml) and a second structured document file (file 2.xml) storing the portion 1 and the portion 2, respectively, of the source structured document may be created. The placeholder <company 1> and the placeholder <company 2> may refer to the first structured document file (file 1.xml) and the second structured document file (file 2.xml), respectively. Next at block 208, a source structured document index file is created which stores the defined placeholders, included in the source structured document, and a remaining portion of the source structured document, which is not referred by any of the placeholders. In the above example, the remaining portion of the source structured document, shown in Table 10, which is not referred by the placeholders <companies>, <company 1>, and <company 2>, includes:
The obtained source structured document index file, is shown in Table 11, which includes:
In one embodiment, the source structured document index file may be directly created based on the source data, such as business data. In this case, the user selection may be received to define placeholders that refer to different portions of the source data. The source structured document index file in this case may include the defined placeholders and the remaining portion of the source data, which is not referred by any of the placeholders. Next at block 210, the source structured document index file obtained at block 208 is transformed to obtain an interim result. Transformation is a process of transforming an input structured document, on which the transformation is to be applied, to obtain an output target document. In one embodiment, the target document may be in a structured or unstructured document form (such as plain text, word processor, spreadsheet, database, pdf, HTML, etc.). For example, the source structured document may be in XML and the target document may be in XML or HTML, or any other format depending on the requirement of the user. The transformation operation may be performed by transformation files, which include transformation rules for transforming the source structured document to the target document. For example, an XML source document may be transformed into an XML target document using an Extensible Stylesheet Language Transformation (XSLT) file. The XSLT transformation may be performed by an XSLT processor which takes as input an XML source document, and an XSLT stylesheet and produces the target document. The XSLT stylesheet contains a collection of transformation rules, which are instructions and other directives that guide the processor in the production of the target document. The XSLT stylesheet may include transformation rules corresponding to the different nodes in the source structured document. The XSLT processor may perform the transformation by matching the nodes in the source structured document with the transformation rules, in the XSLT stylesheet, and applying the corresponding transformation rules to the node. The system may store several transformation files for performing different transformations. For example, different transformation files may be stored in the system to transform the source structured document index file, and the portions of the source structured document referred by the placeholders.
For transforming the source structured document index file, the transformation rules may be applied on the source structured document index file to transform the remaining portion of the source structured document, included in the source structured document index file, to obtain a remaining portion of the target document. An interim result, which includes the remaining portion of the target document and the placeholders in the source structured document index file, may be obtained after the transformation operation on the source structured document index file. In the above example, a source structured document index transformation file for transforming the source structured document index file may include:
As shown in Table 12, the source structured document index transformation file includes the transformation rule <xsl:template match>, which checks whether a node of the source structured document is a “customer name”. In case the node is a “customer name” node, the transformation rule <xsl: value of select=“.”/>, included in the source structured document index transformation file extracts the value of the node (“Customer Name”), from the source structured document, and places the extracted node value in the output structured document (interim result, in this case). In the above example, the value of the <customer name> node, BIG CUSTOMER, is placed in the interim result, shown in Table 13, obtained after transforming the source structured document index file. The interim result obtained after applying the transformation rule to the source structured document index file, shown in Table 13, includes:
As shown the interim result, shown in Table 13, includes the transformation result of the remaining portion of the source structured document (remaining portion of the target document) and the placeholders (<companies>, <company 1>, and <company 2>) defined in the source structured document. Next at block 212 the interim result is traversed to identify the placeholders in the interim result. In the above example, the interim result is traversed to identify three placeholders <companies>, <company 1>, and <company 2>. Next at block 214 the portion of the source structured document referred by the placeholders, included in the interim result, are retrieved for transforming these portions of the source structured document (block 216). In one embodiment, the structured document files storing the portion of the source structured document referred by the placeholders, in the interim result file, are retrieved one by one for performing the transformation operation. The portions of the source structured document may be transformed using the transformation rules, included in the transformation files, to obtain portions of the target document. The transformation operation may be performed in several steps; at each step one of the portions of the source structured document, referred by the placeholders, may be loaded in a memory of the system for performing a transformation operation on the portion of the source structured document. The transformation operation may be repeated until all the portions of the source structured document, referred by the placeholders, are transformed to obtain the portions of the target document. For example, suppose that the interim result file includes three placeholders referring three different portions of the source structured document. The first portion of the source structured document referred by the first placeholder in the interim result may be retrieved and loaded in the memory. The transformation operation may be applied on the first portion of the source structured document to obtain a first portion of the target document. After obtaining the first portion of the target document, the second portion of the source structured document referred by the second placeholder in the interim result may be retrieved for transforming the second portion of the source structured document to a second portion of the target document. Finally after obtaining the second portion of the target document, the third portion of the source structured document referred by the third placeholder in the interim result file may be retrieved for transforming the third portion of the source structured document.
Loading the placeholder one by one in the memory, for performing the transformation operation, ensures that the memory consumed depends on the complexity of transforming the portion of source structured document and not based on the size of the source structured document. As discussed above, different transformation files may be stored in the system for transforming different portions of the source structured document. In the above example, as the portion 1 and the portion 2 of the source structured document referred by the placeholder <company 1> and <company 2>, respectively, include similar elements, a single transformation file may be used for transforming the portion 1 and the portion 2 of the source structured document. The single transformation file, shown in Table 14, may include:
The first portion of the source structured document, shown in Table 8, may be loaded in the memory and the transformation rules in the transformation file may be applied on the first portion of the source structured document to obtain a first portion of the target document, shown in Table 15, which includes:
After obtaining the first portion of the target document, the second portion of the source structured document, shown in Table 9, may be loaded in the memory and the transformation rules in the single transformation file, shown in Table 9, may be applied on the second portion of the source structured document, shown in Table 16, to obtain a second portion of the target document, which includes:
Finally at block 218, a target document is generated based on the portions of the target document obtained at block 216 and the interim result obtained at block 210. In one embodiment, the target document may be obtained by combining the portions of the target document obtained at block 216 and the remaining portion of the target document in the interim result obtained at block 210. The target document may be generated by replacing the portion of the source structured document with the corresponding portions of the target document. In the above example, the first portion, the second portion and the remaining portion of the source structured document are replaced by the first portion of the target document shown in Table 15, the second portion of the target document shown in Table 16, and the remaining portion of the target document included in the interim result shown in Table 13, to obtain the target document, shown in Table 17, which includes:
Some embodiments of the invention may include the above-described methods being written as one or more software components. These components, and the functionality associated with each, may be used by client, server, distributed, or peer computer systems. These components may be written in a computer language corresponding to one or more programming languages such as, functional, declarative, procedural, object-oriented, lower level languages and the like. They may be linked to other components via various application programming interfaces and then compiled into one complete application for a server or a client. Alternatively, the components maybe implemented in server and client applications. Further, these components may be linked together via various distributed programming protocols. Some example embodiments of the invention may include remote procedure calls or web services being used to implement one or more of these components across a distributed programming environment. For example, a logic level may reside on a first computer system that is remotely located from a second computer system containing an interface level (e.g., a graphical user interface). These first and second computer systems can be configured in a server-client, peer-to-peer, or some other configuration. The clients can vary in complexity from mobile and handheld devices, to thin clients and on to thick clients or even other servers.
The above-illustrated software components are tangibly stored on a computer readable storage medium as instructions. The term “computer readable storage medium” should be taken to include a single medium or multiple media that stores one or more sets of instructions. The term “computer readable storage medium” should be taken to include any physical article that is capable of undergoing a set of physical changes to physically store, encode, or otherwise carry a set of instructions for execution by a computer system which causes the computer system to perform any of the methods or process steps described, represented, or illustrated herein. Examples of computer readable storage media include, but are not limited to: magnetic media, such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer readable instructions include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using Java, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hard-wired circuitry in place of, or in combination with machine readable software instructions.
A data source is an information resource. Data sources include sources of data that enable data storage and retrieval. Data sources may include databases, such as, relational, transactional, hierarchical, multi-dimensional (e.g., OLAP), object oriented databases, and the like. Further data sources include tabular data (e.g., spreadsheets, delimited text files), data tagged with a markup language (e.g., XML data), transactional data, unstructured data (e.g., text files, screen scrapings), hierarchical data (e.g., data in a file system, XML data), files, a plurality of reports, and any other data source accessible through an established protocol, such as, Open DataBase Connectivity (ODBC), produced by an underlying software system (e.g., ERP system), and the like. Data sources may also include a data source where the data is not tangibly stored or otherwise ephemeral such as data streams, broadcast data, and the like. These data sources can include associated data foundations, semantic layers, management systems, security systems and so on.
In the above description, numerous specific details are set forth to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however that the invention can be practiced without one or more of the specific details or with other methods, components, techniques, etc. In other instances, well-known operations or structures are not shown or described in details to avoid obscuring aspects of the invention.
Although the processes illustrated and described herein include series of steps, it will be appreciated that the different embodiments of the present invention are not limited by the illustrated ordering of steps, as some steps may occur in different orders, some concurrently with other steps apart from that shown and described herein. In addition, not all illustrated steps may be required to implement a methodology in accordance with the present invention. Moreover, it will be appreciated that the processes may be implemented in association with the apparatus and systems illustrated and described herein as well as in association with other systems not illustrated.
The above descriptions and illustrations of embodiments of the invention, including what is described in the Abstract, is not intended to be exhaustive or to limit the invention to the precise forms disclosed. While specific embodiments of, and examples for, the invention are described herein for illustrative purposes, various equivalent modifications are possible within the scope of the invention, as those skilled in the relevant art will recognize. These modifications can be made to the invention in light of the above detailed description. Rather, the scope of the invention is to be determined by the following claims, which are to be interpreted in accordance with established doctrines of claim construction.
Number | Date | Country | Kind |
---|---|---|---|
201210126487.9 | Apr 2012 | CN | national |